Sunday, August 9, 2015

Predicting Titanic deaths on Kaggle III: Bagging

This is the third post on prediction the deaths. The first one used randomforest, the second boosting (gbm). The aim of the third post was to use bagging. In contrast to the former posts I abandoned dplyr in this post. It gave some now you see now you don't errors.

Data

The data is supposed to be the same as previous.
library(ipred)
library(rpart)
library(lattice)

# read and combine
train <- read.csv('train.csv')
train$status <- 'train'
test  <- read.csv('test.csv')
test$status <- 'test'
test$Survived <- NA
tt <- rbind(test,train)

# generate variables
tt$Pclass <- factor(tt$Pclass)
tt$Survived <- factor(tt$Survived)
tt$age <- tt$Age
tt$age[is.na(tt$age)] <- 35
tt$age <- cut(tt$age,c(0,2,5,9,12,15,21,55,65,100))
tt$Title <- sapply(tt$Name,function(x) strsplit(as.character(x),'[.,]')[[1]][2])
tt$Title <- gsub(' ','',tt$Title)
tt$Title[tt$Title %in% c('Capt','Col','Don','Sir','Jonkheer','Major')] <- 'Mr'
tt$Title[tt$Title %in% c('Lady','Ms','theCountess','Mlle','Mme','Ms','Dona')] <- 'Miss'
tt$Title <- factor(tt$Title)
tt$A <- factor(grepl('A',tt$Cabin))
tt$B <- factor(grepl('B',tt$Cabin))
tt$C <- factor(grepl('C',tt$Cabin))
tt$D <- factor(grepl('D',tt$Cabin))
tt$E <- factor(grepl('E',tt$Cabin))
tt$F <- factor(grepl('F',tt$Cabin))
tt$ncabin <- nchar(as.character(tt$Cabin))
tt$PC <- factor(grepl('PC',tt$Ticket))
tt$STON <- factor(grepl('STON',tt$Ticket))
tt$cn <- as.numeric(gsub('[[:space:][:alpha:]]','',tt$Cabin))
tt$oe <- factor(ifelse(!is.na(tt$cn),tt$cn%%2,-1))
tt$Fare[is.na(tt$Fare)]<- median(tt$Fare,na.rm=TRUE)

Age

The first step is again to predict the missing ages. Even though we have I have all data available in one data.frame, I still think the correct approach is to create the age model using only the training data. Note that I am not too impressed with the age model. Perhaps this should also be optimized.
forage <- tt[!is.na(tt$Age) & tt$status=='train',names(tt) %in% 
   c('Age','Sex','Pclass','SibSP',
   'Parch','Fare','Title','Embarked','A','B','C','D','E','F',
   'ncabin','PC','STON','oe')]

ipbag1 <- bagging(Age ~.,data=forage)
ipbag1
Bagging regression trees with 25 bootstrap replications 

Call: bagging.data.frame(formula = Age ~ ., data = forage)
plot(tt$Age~predict(ipbag1,tt))
tt$AGE <- tt$Age
tt$AGE[is.na(tt$AGE)] <- predict(ipbag1,tt[is.na(tt$AGE),])

Selecting the survival model

ipred, the package in which bagging resides, comes with a nice general purpose cross validation utility. In the end, I decided the two parameters to be optimized are ns; the size of the bags and minsplit: the minimum number of observations that must exist in a node in order for a split to be attempted. Nbagg, the number of bootstrap evaluations, just needs to be big enough. Regarding nbagg, I have the feeling that this particular problem, with relatively few records, it may be needed to have relatively high nbagg in order to have reproducible models.
di1 <- subset(titanic,select=c(
    age,SibSp,Parch,Fare,Sex,Pclass,
        Title,Embarked,A,B,C,D,E,F,ncabin,PC,STON,oe,AGE,Survived))
dso <- expand.grid(ns=seq(100,300,25),nbagg=c(500),minsplit=1:6)
la <- lapply(1:nrow(dso),function(ii) {
   ee <-    errorest(Survived ~ .,
    ns=dso$ns[ii],
    control=rpart.control(minsplit=dso$minsplit[ii], cp=0, 
       xval=0,maxsurrogate=0),
    nbagg=dso$nbagg[ii],
    model=bagging,
    data=di1,
    est.para=control.errorest(k=20)
    )
    cc <- c(ns=dso$ns[ii],minsplit=dso$minsplit[ii],nbagg=dso$nbagg[ii],error=ee$error)
    print(cc)
    cc
  })
las <- do.call(rbind,la) 
las <- as.data.frame(las)
xyplot(error ~ ns, groups= minsplit, data=las,auto.key=TRUE,type='l')

Predictions

Based on the plot I have chosen for ns=275 and minsplit=5. But, to be honest, in a previous run I had chosen ns=150 and minsplit=2. Obviously from this plot a silly choice. But, given the high variability in this plot between parameters which are relatively similar and the totally different result, I actually think there is relatively much noise in the validation. Thus what is actually seen is that there is relatively little difference between the settings.
Having said that, these new settings got me just over 0.8 in the Kaggle score, while the previous settings were just below.
 bagmod <- bagging(Survived ~.,ns=275,nbagg=500,
    control=rpart.control(minsplit=5, cp=0, xval=0,maxsurrogate=0),
    data=di1)

pp <- predict(bagmod,test)

out <- data.frame(
    PassengerId=test$PassengerId,
    Survived=pp,row.names=NULL)
write.csv(x=out,
    file='bag8aug.csv',
    row.names=FALSE,
    quote=FALSE)

2 comments:

  1. When you do
    di1 <- subset(titanic,select=c(...

    what is the object "titanic"? Should it be this?
    di1 <- subset(tt[tt$status=="train",],select=c(...

    Because you didn't change the NA of "train" object but of "tt" and "tt" has also the test set.

    ReplyDelete
  2. I think you are correct. I tried to create clean code for the blog, but something got mixed up.

    ReplyDelete