Wiekvoet: Predicting Titanic deaths on Kaggle

Kaggle has a competition to predict who will die on the famous Titanic 'Machine Learning from Disaster''. It is placed as knowledge competition. Just up there to learn. I am late to the party, it has been been for 1 1/2 year, to end by end 2015. It is a small data set, hence interesting to learn from. It is also a competition with a number of entries which have perfect predictions.
Just for fun, I have been trying to see what I would achieve with simple attempt with randomforest. For those in the competition, this randomforest got me 0.74163, placing me 2781 out of 3064 entries. An acceptable spot, since this is using off the shelf approach. An improvement may follow in a subsequent post.

Data

Data downloaded from Kaggle. It is real world data, hence has the odd missing (in passenger age) and a number of columns with messy data, which might be employed to create additional variables. For the purpose of validation about 90% of the data gets flagged to be training set. test will be the test, set, results of which to be passed back to Kaggle.
PassengerId Survived Pclass Name
1 1 0 3 Braund, Mr. Owen Harris
2 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer)
3 3 1 3 Heikkinen, Miss. Laina
4 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel)
5 5 0 3 Allen, Mr. William Henry
6 6 0 3 Moran, Mr. James
Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 male 22 1 0 A/5 21171 7.2500 S
2 female 38 1 0 PC 17599 71.2833 C85 C
3 female 26 0 0 STON/O2. 3101282 7.9250 S
4 female 35 1 0 113803 53.1000 C123 S
5 male 35 0 0 373450 8.0500 S
6 male NA 0 0 330877 8.4583 Q

library(dplyr)

library(randomForest)

library(lattice)

options(width=85)

head(read.csv('train.csv'))

titanic <- read.csv('train.csv') %>%

mutate(.,Pclass=factor(Pclass),

Survived=factor(Survived),

age=ifelse(is.na(Age),35,Age),

age = cut(age,c(0,2,5,9,12,15,21,55,65,100)),

A=grepl('A',Cabin),

B=grepl('B',Cabin),

C=grepl('C',Cabin),

D=grepl('D',Cabin),

cn = as.numeric(gsub('[[:space:][:alpha:]]','',Cabin)),

oe=factor(ifelse(!is.na(cn),cn%%2,-1)),

train = sample(c(TRUE,FALSE),

size=891,

replace=TRUE,

prob=c(.9,.1) ) )

test <- read.csv('test.csv') %>%

mutate(.,Pclass=factor(Pclass),

age=ifelse(is.na(Age),35,Age),

age = cut(age,c(0,2,5,9,12,15,21,55,65,100)),

A=grepl('A',Cabin),

B=grepl('B',Cabin),

C=grepl('C',Cabin),

D=grepl('D',Cabin),

cn = as.numeric(gsub('[[:space:][:alpha:]]','',Cabin)),

oe=factor(ifelse(!is.na(cn),cn%%2,-1)),

Embarked=factor(Embarked,levels=levels(titanic$Embarked))

)

test$Fare[is.na(test$Fare)]<- median(titanic$Fare)

Age has missing values, hence the first step is to fill those in. In the code above, an age factor has been made, where missings are imputed the largest category.

Model building

A simple prediction using randomForest.

rf1 <- randomForest(Survived ~

Sex+Pclass + SibSp +

Parch + Fare +

Embarked + age +

A+B+C+D +oe,

data=titanic,

subset=train,

replace=FALSE,

ntree=1000)

Call:

randomForest(formula = Survived ~ Sex + Pclass + SibSp + Parch + Fare + Embarked + age + A + B + C + D + oe, data = titanic, replace = FALSE, ntree = 1000, subset = train)

Type of random forest: classification

Number of trees: 1000

No. of variables tried at each split: 3

OOB estimate of error rate: 16.94%

Confusion matrix:

0 1 class.error

0 454 40 0.08097166

1 95 208 0.31353135

This shows some bias in the predictions. Class one gets predicted as class 0 far too often. Hence I will optimize not only the normal variables nodesize (Minimum size of terminal nodes) and mtry (Number of variables randomly sampled as candidates at each split) but also classwt (Priors of the classes).

titanic$pred <- predict(rf1,titanic)
with(titanic[!titanic$train,],sum(pred!=Survived)/length(pred))

mygrid <- expand.grid(nodesize=c(2,4,6),
mtry=2:5,
wt=seq(.5,.7,.05))
sa <- sapply(1:nrow(mygrid), function(i) {
rfx <- randomForest(Survived ~
Sex+Pclass + SibSp +
Parch + Fare +
Embarked + age +
A+B+C+D +oe,
data=titanic,
subset=train,
replace=TRUE,
ntree=4000,
nodesize=mygrid$nodesize[i],
mtry=mygrid$mtry[i],
classwt=c(1-mygrid$wt[i],mygrid$wt[i]))
preds <- predict(rfx,titanic[!titanic$train,])
nwrong <- sum(preds!=titanic$Survived[!titanic$train])
c(nodesize=mygrid$nodesize[i],mtry=mygrid$mtry[i],wt=mygrid$wt[i],pw=nwrong/length(preds))
})
tsa <- as.data.frame(t(sa))
xyplot(pw ~ wt | mtry,group=factor(nodesize), data=tsa,auto.key=TRUE,type='l')

What is less visible from this plot is the amount of variation I had in the results. One prediction better or worse really makes a difference in the figure. This is the reason I have increased the number of trees in the model.

Final predictions

Using the best settings from above, gets you to the bottom of the ranking. The script makes the model, writes predictions in the format required by kaggle.

rf2 <- randomForest(Survived ~
Sex+Pclass + SibSp +
Parch + Fare +
Embarked + age +
A+B+C+D +oe,
data=titanic,
replace=TRUE,
ntree=5000,
nodesize=4,
mtry=3,
classwt=c(1-.6,.6))

pp <- predict(rf2,test)
out <- data.frame(
PassengerId=test$PassengerId,
Survived=pp,row.names=NULL)
write.csv(x=out,
file='rf1.csv',
row.names=FALSE,
quote=FALSE)

4 comments:

Unknown10 August, 2015 00:49
Thanks for posting your solutions to this competition.

I tried to run your code "as is" but had some problems. I was able to get models that roughly correspond to your results by doing the following:

1) I made "Sex" a factor in the data frames "titanic" and "test"
2) I made "Embarked" a factor in data frame "titanic"
3) For titanic$Embarked I changed the levels from "" "C" "Q" "S" to "X" "C" "Q" "S" (there were two instances of empty strings)

You might want to consider using "set.seed()" so others following your posts can reproduce your results. (I realize this will probably only work with similar operating systems, but it is still a best practice.)

Hope this helps others following along.
Frank Inklaar20 August, 2015 16:43
About the entries with perfect preditions: I find it hard to beleive that the moment that the ship struck the iceberg the fate of every single person on board was completely determined by the few things about them that are listed on the passeneger list. It's more likely that some persons looked up the total result somewhere and created the perfect submission from the complete data.

Thanks for the examples, I had to do some slight fiddeling to get it working but the idea is clear and useful!

Sunday, July 19, 2015

Predicting Titanic deaths on Kaggle

Data

Model building

Final predictions

4 comments: