Just for fun, I have been trying to see what I would achieve with simple attempt with randomforest. For those in the competition, this randomforest got me 0.74163, placing me 2781 out of 3064 entries. An acceptable spot, since this is using off the shelf approach. An improvement may follow in a subsequent post.
Data
Data downloaded from Kaggle. It is real world data, hence has the odd missing (in passenger age) and a number of columns with messy data, which might be employed to create additional variables. For the purpose of validation about 90% of the data gets flagged to be training set. test will be the test, set, results of which to be passed back to Kaggle.PassengerId Survived Pclass Name
1 1 0 3 Braund, Mr. Owen Harris
2 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer)
3 3 1 3 Heikkinen, Miss. Laina
4 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel)
5 5 0 3 Allen, Mr. William Henry
6 6 0 3 Moran, Mr. James
Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 male 22 1 0 A/5 21171 7.2500 S
2 female 38 1 0 PC 17599 71.2833 C85 C
3 female 26 0 0 STON/O2. 3101282 7.9250 S
4 female 35 1 0 113803 53.1000 C123 S
5 male 35 0 0 373450 8.0500 S
6 male NA 0 0 330877 8.4583 Q
library(dplyr)
library(randomForest)
library(lattice)
options(width=85)
head(read.csv('train.csv'))
titanic <- read.csv('train.csv') %>%
mutate(.,Pclass=factor(Pclass),
Survived=factor(Survived),
age=ifelse(is.na(Age),35,Age),
age = cut(age,c(0,2,5,9,12,15,21,55,65,100)),
A=grepl('A',Cabin),
B=grepl('B',Cabin),
C=grepl('C',Cabin),
D=grepl('D',Cabin),
cn = as.numeric(gsub('[[:space:][:alpha:]]','',Cabin)),
oe=factor(ifelse(!is.na(cn),cn%%2,-1)),
train = sample(c(TRUE,FALSE),
size=891,
replace=TRUE,
prob=c(.9,.1) ) )
test <- read.csv('test.csv') %>%
mutate(.,Pclass=factor(Pclass),
age=ifelse(is.na(Age),35,Age),
age = cut(age,c(0,2,5,9,12,15,21,55,65,100)),
A=grepl('A',Cabin),
B=grepl('B',Cabin),
C=grepl('C',Cabin),
D=grepl('D',Cabin),
cn = as.numeric(gsub('[[:space:][:alpha:]]','',Cabin)),
oe=factor(ifelse(!is.na(cn),cn%%2,-1)),
Embarked=factor(Embarked,levels=levels(titanic$Embarked))
)
test$Fare[is.na(test$Fare)]<- median(titanic$Fare)
Age has missing values, hence the first step is to fill those in. In the code above, an age factor has been made, where missings are imputed the largest category.
Model building
A simple prediction using randomForest.
rf1 <- randomForest(Survived ~
Sex+Pclass + SibSp +
Parch + Fare +
Embarked + age +
A+B+C+D +oe,
data=titanic,
subset=train,
replace=FALSE,
ntree=1000)
Call:
randomForest(formula = Survived ~ Sex + Pclass + SibSp + Parch + Fare + Embarked + age + A + B + C + D + oe, data = titanic, replace = FALSE, ntree = 1000, subset = train)
Type of random forest: classification
Number of trees: 1000
No. of variables tried at each split: 3
OOB estimate of error rate: 16.94%
Confusion matrix:
0 1 class.error
0 454 40 0.08097166
1 95 208 0.31353135
titanic$pred <- predict(rf1,titanic)
with(titanic[!titanic$train,],sum(pred!=Survived)/length(pred))
mygrid <- expand.grid(nodesize=c(2,4,6),
mtry=2:5,
wt=seq(.5,.7,.05))
sa <- sapply(1:nrow(mygrid), function(i) {
rfx <- randomForest(Survived ~
Sex+Pclass + SibSp +
Parch + Fare +
Embarked + age +
A+B+C+D +oe,
data=titanic,
subset=train,
replace=TRUE,
ntree=4000,
nodesize=mygrid$nodesize[i],
mtry=mygrid$mtry[i],
classwt=c(1-mygrid$wt[i],mygrid$wt[i]))
preds <- predict(rfx,titanic[!titanic$train,])
nwrong <- sum(preds!=titanic$Survived[!titanic$train])
c(nodesize=mygrid$nodesize[i],mtry=mygrid$mtry[i],wt=mygrid$wt[i],pw=nwrong/length(preds))
})
tsa <- as.data.frame(t(sa))
xyplot(pw ~ wt | mtry,group=factor(nodesize), data=tsa,auto.key=TRUE,type='l')
What is less visible from this plot is the amount of variation I had in the results. One prediction better or worse really makes a difference in the figure. This is the reason I have increased the number of trees in the model.
Final predictions
Using the best settings from above, gets you to the bottom of the ranking. The script makes the model, writes predictions in the format required by kaggle.
rf2 <- randomForest(Survived ~ Sex+Pclass + SibSp +
Parch + Fare +
Embarked + age +
A+B+C+D +oe,
data=titanic,
replace=TRUE,
ntree=5000,
nodesize=4,
mtry=3,
classwt=c(1-.6,.6))
pp <- predict(rf2,test)
out <- data.frame(
PassengerId=test$PassengerId,
Survived=pp,row.names=NULL)
write.csv(x=out,
file='rf1.csv',
row.names=FALSE,
quote=FALSE)
Thanks for posting your solutions to this competition.
ReplyDeleteI tried to run your code "as is" but had some problems. I was able to get models that roughly correspond to your results by doing the following:
1) I made "Sex" a factor in the data frames "titanic" and "test"
2) I made "Embarked" a factor in data frame "titanic"
3) For titanic$Embarked I changed the levels from "" "C" "Q" "S" to "X" "C" "Q" "S" (there were two instances of empty strings)
You might want to consider using "set.seed()" so others following your posts can reproduce your results. (I realize this will probably only work with similar operating systems, but it is still a best practice.)
Hope this helps others following along.
There is something in this dplyr code which gives strange and not reproducible errors. That might be what caused some of these effects you describe. My current approach is dplyr free.
DeleteRegarding the empty string in Embarked, that is a valid factor level. But to get predictions the empty string needs to be added in test set too.
I did one set.seed for a small test set, but to get reproducible I probably should set it before the start for any model estimation. In due course more models were created than shown in the blog. How set.seed will retain its property if the work is distributed over several cores is unclear to me. So, I might refrain from this approach.
Note that it is my intention to create code that can easily be run by the reader. Any comments to get that is welcome.
About the entries with perfect preditions: I find it hard to beleive that the moment that the ship struck the iceberg the fate of every single person on board was completely determined by the few things about them that are listed on the passeneger list. It's more likely that some persons looked up the total result somewhere and created the perfect submission from the complete data.
ReplyDeleteThanks for the examples, I had to do some slight fiddeling to get it working but the idea is clear and useful!
I have my doubts about those perfect predictions too. But I do find it a nice set data to try methods with. And for that I am happy that this proves useful.
Delete