Wiekvoet: Predicting Titanic deaths on Kaggle III: Bagging

This is the third post on prediction the deaths. The first one used randomforest, the second boosting (gbm). The aim of the third post was to use bagging. In contrast to the former posts I abandoned dplyr in this post. It gave some now you see now you don't errors.

Data

The data is supposed to be the same as previous.
library(ipred)
library(rpart)
library(lattice)

# read and combine
train <- read.csv('train.csv')
train$status <- 'train'
test <- read.csv('test.csv')
test$status <- 'test'
test$Survived <- NA
tt <- rbind(test,train)

# generate variables
tt$Pclass <- factor(tt$Pclass)
tt$Survived <- factor(tt$Survived)
tt$age <- tt$Age
tt$age[is.na(tt$age)] <- 35
tt$age <- cut(tt$age,c(0,2,5,9,12,15,21,55,65,100))
tt$Title <- sapply(tt$Name,function(x) strsplit(as.character(x),'[.,]')[[1]][2])
tt$Title <- gsub(' ','',tt$Title)
tt$Title[tt$Title %in% c('Capt','Col','Don','Sir','Jonkheer','Major')] <- 'Mr'
tt$Title[tt$Title %in% c('Lady','Ms','theCountess','Mlle','Mme','Ms','Dona')] <- 'Miss'
tt$Title <- factor(tt$Title)
tt$A <- factor(grepl('A',tt$Cabin))
tt$B <- factor(grepl('B',tt$Cabin))
tt$C <- factor(grepl('C',tt$Cabin))
tt$D <- factor(grepl('D',tt$Cabin))
tt$E <- factor(grepl('E',tt$Cabin))
tt$F <- factor(grepl('F',tt$Cabin))
tt$ncabin <- nchar(as.character(tt$Cabin))
tt$PC <- factor(grepl('PC',tt$Ticket))
tt$STON <- factor(grepl('STON',tt$Ticket))
tt$cn <- as.numeric(gsub('[[:space:][:alpha:]]','',tt$Cabin))
tt$oe <- factor(ifelse(!is.na(tt$cn),tt$cn%%2,-1))
tt$Fare[is.na(tt$Fare)]<- median(tt$Fare,na.rm=TRUE)

Age

The first step is again to predict the missing ages. Even though we have I have all data available in one data.frame, I still think the correct approach is to create the age model using only the training data. Note that I am not too impressed with the age model. Perhaps this should also be optimized.

forage <- tt[!is.na(tt$Age) & tt$status=='train',names(tt) %in%

c('Age','Sex','Pclass','SibSP',

'Parch','Fare','Title','Embarked','A','B','C','D','E','F',

'ncabin','PC','STON','oe')]

ipbag1 <- bagging(Age ~.,data=forage)

ipbag1

Bagging regression trees with 25 bootstrap replications

Call: bagging.data.frame(formula = Age ~ ., data = forage)

plot(tt$Age~predict(ipbag1,tt))

tt$AGE <- tt$Age

tt$AGE[is.na(tt$AGE)] <- predict(ipbag1,tt[is.na(tt$AGE),])

Selecting the survival model

ipred, the package in which bagging resides, comes with a nice general purpose cross validation utility. In the end, I decided the two parameters to be optimized are ns; the size of the bags and minsplit: the minimum number of observations that must exist in a node in order for a split to be attempted. Nbagg, the number of bootstrap evaluations, just needs to be big enough. Regarding nbagg, I have the feeling that this particular problem, with relatively few records, it may be needed to have relatively high nbagg in order to have reproducible models.

di1 <- subset(titanic,select=c(

age,SibSp,Parch,Fare,Sex,Pclass,

Title,Embarked,A,B,C,D,E,F,ncabin,PC,STON,oe,AGE,Survived))

dso <- expand.grid(ns=seq(100,300,25),nbagg=c(500),minsplit=1:6)

la <- lapply(1:nrow(dso),function(ii) {

ee <- errorest(Survived ~ .,

ns=dso$ns[ii],

control=rpart.control(minsplit=dso$minsplit[ii], cp=0,
xval=0,maxsurrogate=0),

nbagg=dso$nbagg[ii],

model=bagging,

data=di1,

est.para=control.errorest(k=20)

)

cc <- c(ns=dso$ns[ii],minsplit=dso$minsplit[ii],nbagg=dso$nbagg[ii],error=ee$error)

print(cc)

})

las <- do.call(rbind,la)

las <- as.data.frame(las)

xyplot(error ~ ns, groups= minsplit, data=las,auto.key=TRUE,type='l')

Predictions

Based on the plot I have chosen for ns=275 and minsplit=5. But, to be honest, in a previous run I had chosen ns=150 and minsplit=2. Obviously from this plot a silly choice. But, given the high variability in this plot between parameters which are relatively similar and the totally different result, I actually think there is relatively much noise in the validation. Thus what is actually seen is that there is relatively little difference between the settings.

Having said that, these new settings got me just over 0.8 in the Kaggle score, while the previous settings were just below.

bagmod <- bagging(Survived ~.,ns=275,nbagg=500,

control=rpart.control(minsplit=5, cp=0, xval=0,maxsurrogate=0),

data=di1)

pp <- predict(bagmod,test)

out <- data.frame(

PassengerId=test$PassengerId,

Survived=pp,row.names=NULL)

write.csv(x=out,

file='bag8aug.csv',

row.names=FALSE,

quote=FALSE)

2 comments:

Alberto26 February, 2016 12:35
When you do
di1 <- subset(titanic,select=c(...

what is the object "titanic"? Should it be this?
di1 <- subset(tt[tt$status=="train",],select=c(...

Because you didn't change the NA of "train" object but of "tt" and "tt" has also the test set.
Wingfeet29 February, 2016 20:06
I think you are correct. I tried to create clean code for the blog, but something got mixed up.

Sunday, August 9, 2015

Predicting Titanic deaths on Kaggle III: Bagging

Data

Age

Selecting the survival model

Predictions

2 comments: