## Sunday, July 19, 2015

### Predicting Titanic deaths on Kaggle

Kaggle has a competition to predict who will die on the famous Titanic 'Machine Learning from Disaster''. It is placed as knowledge competition. Just up there to learn. I am late to the party, it has been been for 1 1/2 year, to end by end 2015. It is a small data set, hence interesting to learn from. It is also a competition with a number of entries which have perfect predictions.
Just for fun, I have been trying to see what I would achieve with simple attempt with randomforest. For those in the competition, this randomforest got me 0.74163, placing me 2781 out of 3064 entries. An acceptable spot, since this is using off the shelf approach. An improvement may follow in a subsequent post.

#### Data

Data downloaded from Kaggle. It is real world data, hence has the odd missing (in passenger age) and a number of columns with messy data, which might be employed to create additional variables. For the purpose of validation about 90% of the data gets flagged to be training set. test will be the test, set, results of which to be passed back to Kaggle.
PassengerId Survived Pclass                                                Name
1           1        0      3                             Braund, Mr. Owen Harris
2           2        1      1 Cumings, Mrs. John Bradley (Florence Briggs Thayer)
3           3        1      3                              Heikkinen, Miss. Laina
4           4        1      1        Futrelle, Mrs. Jacques Heath (Lily May Peel)
5           5        0      3                            Allen, Mr. William Henry
6           6        0      3                                    Moran, Mr. James
Sex Age SibSp Parch           Ticket    Fare Cabin Embarked
1   male  22     1     0        A/5 21171  7.2500              S
2 female  38     1     0         PC 17599 71.2833   C85        C
3 female  26     0     0 STON/O2. 3101282  7.9250              S
4 female  35     1     0           113803 53.1000  C123        S
5   male  35     0     0           373450  8.0500              S
6   male  NA     0     0           330877  8.4583              Q
library(dplyr)
library(randomForest)
library(lattice)
options(width=85)

mutate(.,Pclass=factor(Pclass),
Survived=factor(Survived),
age=ifelse(is.na(Age),35,Age),
age = cut(age,c(0,2,5,9,12,15,21,55,65,100)),
A=grepl('A',Cabin),
B=grepl('B',Cabin),
C=grepl('C',Cabin),
D=grepl('D',Cabin),
cn = as.numeric(gsub('[[:space:][:alpha:]]','',Cabin)),
oe=factor(ifelse(!is.na(cn),cn%%2,-1)),
train = sample(c(TRUE,FALSE),
size=891,
replace=TRUE,
prob=c(.9,.1)   ) )
mutate(.,Pclass=factor(Pclass),
age=ifelse(is.na(Age),35,Age),
age = cut(age,c(0,2,5,9,12,15,21,55,65,100)),
A=grepl('A',Cabin),
B=grepl('B',Cabin),
C=grepl('C',Cabin),
D=grepl('D',Cabin),
cn = as.numeric(gsub('[[:space:][:alpha:]]','',Cabin)),
oe=factor(ifelse(!is.na(cn),cn%%2,-1)),
Embarked=factor(Embarked,levels=levels(titanic\$Embarked))
)
test\$Fare[is.na(test\$Fare)]<- median(titanic\$Fare)
Age has missing values, hence the first step is to fill those in. In the code above, an age factor has been made, where missings are imputed the largest category.

#### Model building

A simple prediction using randomForest.
rf1 <- randomForest(Survived ~
Sex+Pclass + SibSp +
Parch + Fare +
Embarked + age +
A+B+C+D +oe,
data=titanic,
subset=train,
replace=FALSE,
ntree=1000)
Call:
randomForest(formula = Survived ~ Sex + Pclass + SibSp + Parch +      Fare + Embarked + age + A + B + C + D + oe, data = titanic,      replace = FALSE, ntree = 1000, subset = train)
Type of random forest: classification
Number of trees: 1000
No. of variables tried at each split: 3

OOB estimate of  error rate: 16.94%
Confusion matrix:
0   1 class.error
0 454  40  0.08097166
1  95 208  0.31353135

This shows some bias in the predictions. Class one gets predicted as class 0 far too often. Hence I will optimize not only the normal variables nodesize (Minimum size of terminal nodes) and mtry (Number of variables randomly sampled as candidates at each split) but also classwt (Priors of the classes).

titanic\$pred <- predict(rf1,titanic)
with(titanic[!titanic\$train,],sum(pred!=Survived)/length(pred))

mygrid <- expand.grid(nodesize=c(2,4,6),
mtry=2:5,
wt=seq(.5,.7,.05))
sa <- sapply(1:nrow(mygrid), function(i) {
rfx <- randomForest(Survived ~
Sex+Pclass + SibSp +
Parch + Fare +
Embarked + age +
A+B+C+D +oe,
data=titanic,
subset=train,
replace=TRUE,
ntree=4000,
nodesize=mygrid\$nodesize[i],
mtry=mygrid\$mtry[i],
classwt=c(1-mygrid\$wt[i],mygrid\$wt[i]))
preds <- predict(rfx,titanic[!titanic\$train,])
nwrong <- sum(preds!=titanic\$Survived[!titanic\$train])
c(nodesize=mygrid\$nodesize[i],mtry=mygrid\$mtry[i],wt=mygrid\$wt[i],pw=nwrong/length(preds))
})
tsa <- as.data.frame(t(sa))
xyplot(pw ~ wt | mtry,group=factor(nodesize), data=tsa,auto.key=TRUE,type='l')

What is less visible from this plot is the amount of variation I had in the results. One prediction better or worse really makes a difference in the figure. This is the reason I have increased the number of trees in the model.

#### Final predictions

Using the best settings from above, gets you to the bottom of the ranking. The script makes the model, writes predictions in the format required by kaggle.
rf2 <- randomForest(Survived ~
Sex+Pclass + SibSp +
Parch + Fare +
Embarked + age +
A+B+C+D +oe,
data=titanic,
replace=TRUE,
ntree=5000,
nodesize=4,
mtry=3,
classwt=c(1-.6,.6))

pp <- predict(rf2,test)
out <- data.frame(
PassengerId=test\$PassengerId,
Survived=pp,row.names=NULL)
write.csv(x=out,
file='rf1.csv',
row.names=FALSE,
quote=FALSE)

1. Thanks for posting your solutions to this competition.

I tried to run your code "as is" but had some problems. I was able to get models that roughly correspond to your results by doing the following:

1) I made "Sex" a factor in the data frames "titanic" and "test"
2) I made "Embarked" a factor in data frame "titanic"
3) For titanic\$Embarked I changed the levels from "" "C" "Q" "S" to "X" "C" "Q" "S" (there were two instances of empty strings)

You might want to consider using "set.seed()" so others following your posts can reproduce your results. (I realize this will probably only work with similar operating systems, but it is still a best practice.)

Hope this helps others following along.

1. There is something in this dplyr code which gives strange and not reproducible errors. That might be what caused some of these effects you describe. My current approach is dplyr free.
Regarding the empty string in Embarked, that is a valid factor level. But to get predictions the empty string needs to be added in test set too.
I did one set.seed for a small test set, but to get reproducible I probably should set it before the start for any model estimation. In due course more models were created than shown in the blog. How set.seed will retain its property if the work is distributed over several cores is unclear to me. So, I might refrain from this approach.
Note that it is my intention to create code that can easily be run by the reader. Any comments to get that is welcome.

2. About the entries with perfect preditions: I find it hard to beleive that the moment that the ship struck the iceberg the fate of every single person on board was completely determined by the few things about them that are listed on the passeneger list. It's more likely that some persons looked up the total result somewhere and created the perfect submission from the complete data.

Thanks for the examples, I had to do some slight fiddeling to get it working but the idea is clear and useful!

1. I have my doubts about those perfect predictions too. But I do find it a nice set data to try methods with. And for that I am happy that this proves useful.