Data
The data is supposed to be the same as previous.library(ipred)
library(rpart)
library(lattice)
# read and combine
train <- read.csv('train.csv')
train$status <- 'train'
test <- read.csv('test.csv')
test$status <- 'test'
test$Survived <- NA
tt <- rbind(test,train)
# generate variables
tt$Pclass <- factor(tt$Pclass)
tt$Survived <- factor(tt$Survived)
tt$age <- tt$Age
tt$age[is.na(tt$age)] <- 35
tt$age <- cut(tt$age,c(0,2,5,9,12,15,21,55,65,100))
tt$Title <- sapply(tt$Name,function(x) strsplit(as.character(x),'[.,]')[[1]][2])
tt$Title <- gsub(' ','',tt$Title)
tt$Title[tt$Title %in% c('Capt','Col','Don','Sir','Jonkheer','Major')] <- 'Mr'
tt$Title[tt$Title %in% c('Lady','Ms','theCountess','Mlle','Mme','Ms','Dona')] <- 'Miss'
tt$Title <- factor(tt$Title)
tt$A <- factor(grepl('A',tt$Cabin))
tt$B <- factor(grepl('B',tt$Cabin))
tt$C <- factor(grepl('C',tt$Cabin))
tt$D <- factor(grepl('D',tt$Cabin))
tt$E <- factor(grepl('E',tt$Cabin))
tt$F <- factor(grepl('F',tt$Cabin))
tt$ncabin <- nchar(as.character(tt$Cabin))
tt$PC <- factor(grepl('PC',tt$Ticket))
tt$STON <- factor(grepl('STON',tt$Ticket))
tt$cn <- as.numeric(gsub('[[:space:][:alpha:]]','',tt$Cabin))
tt$oe <- factor(ifelse(!is.na(tt$cn),tt$cn%%2,-1))
tt$Fare[is.na(tt$Fare)]<- median(tt$Fare,na.rm=TRUE)
Age
The first step is again to predict the missing ages. Even though we have I have all data available in one data.frame, I still think the correct approach is to create the age model using only the training data. Note that I am not too impressed with the age model. Perhaps this should also be optimized.
forage <- tt[!is.na(tt$Age) & tt$status=='train',names(tt) %in%
c('Age','Sex','Pclass','SibSP',
'Parch','Fare','Title','Embarked','A','B','C','D','E','F',
'ncabin','PC','STON','oe')]
ipbag1 <- bagging(Age ~.,data=forage)
ipbag1
Bagging regression trees with 25 bootstrap replications
Call: bagging.data.frame(formula = Age ~ ., data = forage)
plot(tt$Age~predict(ipbag1,tt))
tt$AGE <- tt$Age
tt$AGE[is.na(tt$AGE)] <- predict(ipbag1,tt[is.na(tt$AGE),])
Selecting the survival model
ipred, the package in which bagging resides, comes with a nice general purpose cross validation utility. In the end, I decided the two parameters to be optimized are ns; the size of the bags and minsplit: the minimum number of observations that must exist in a node in order for a split to be attempted. Nbagg, the number of bootstrap evaluations, just needs to be big enough. Regarding nbagg, I have the feeling that this particular problem, with relatively few records, it may be needed to have relatively high nbagg in order to have reproducible models.
di1 <- subset(titanic,select=c(
age,SibSp,Parch,Fare,Sex,Pclass,
Title,Embarked,A,B,C,D,E,F,ncabin,PC,STON,oe,AGE,Survived))
dso <- expand.grid(ns=seq(100,300,25),nbagg=c(500),minsplit=1:6)
la <- lapply(1:nrow(dso),function(ii) {
ee <- errorest(Survived ~ .,
ns=dso$ns[ii],
control=rpart.control(minsplit=dso$minsplit[ii], cp=0,
xval=0,maxsurrogate=0),
xval=0,maxsurrogate=0),
nbagg=dso$nbagg[ii],
model=bagging,
data=di1,
est.para=control.errorest(k=20)
)
cc <- c(ns=dso$ns[ii],minsplit=dso$minsplit[ii],nbagg=dso$nbagg[ii],error=ee$error)
print(cc)
cc
})
las <- do.call(rbind,la)
las <- as.data.frame(las)
xyplot(error ~ ns, groups= minsplit, data=las,auto.key=TRUE,type='l')
Predictions
Based on the plot I have chosen for ns=275 and minsplit=5. But, to be honest, in a previous run I had chosen ns=150 and minsplit=2. Obviously from this plot a silly choice. But, given the high variability in this plot between parameters which are relatively similar and the totally different result, I actually think there is relatively much noise in the validation. Thus what is actually seen is that there is relatively little difference between the settings.
Having said that, these new settings got me just over 0.8 in the Kaggle score, while the previous settings were just below.
bagmod <- bagging(Survived ~.,ns=275,nbagg=500,
control=rpart.control(minsplit=5, cp=0, xval=0,maxsurrogate=0),
data=di1)
pp <- predict(bagmod,test)
out <- data.frame(
PassengerId=test$PassengerId,
Survived=pp,row.names=NULL)
write.csv(x=out,
file='bag8aug.csv',
row.names=FALSE,
quote=FALSE)
When you do
ReplyDeletedi1 <- subset(titanic,select=c(...
what is the object "titanic"? Should it be this?
di1 <- subset(tt[tt$status=="train",],select=c(...
Because you didn't change the NA of "train" object but of "tt" and "tt" has also the test set.
I think you are correct. I tried to create clean code for the blog, but something got mixed up.
ReplyDelete