What I aim to do is enter a load of variables in the Stan model. Aliasing will be ignored, and I hope the hierarchical model will provide suitable shrinkage for terms which are not relevant.
Data
The data has been mildly adapted from previously. Biggest change is that I have decided to make age into a factor, based on the value when present, with a special level for missing. The alternative in context of a Bayesian model would be to make fitting age part of the model, but that seems more complex than I am willing to go at this point.
Model
The idea of the model is pretty simple. it will be logistic regression, with only factors in the independent variables. All levels in all factors are used. So when Sex is entered in the model, it will add two parameter values, one for male, one for female (see code below, variable fsex). When passenger class is entered in the model, it has three values (variabe fpclass). Upon the parameter values there is a prior distribution. I have chosen a normal prior distribution, around zero with standard deviation sdsex and sdpclass respectively. These have a common prior (sd1). I have used both half normal with standard deviation 3 as below and uniform (0,3) for the prior.
parameters {
real fsex[2];
real intercept;
real fpclass[3];
real <lower=0> sdsex;
real <lower=0> sdpclass;
real <lower=0> sd1;
}
transformed parameters {
real expect[ntrain];
for (i in 1:ntrain) {
expect[i] <- inv_logit(
intercept+
fsex[sex[i]]+
fpclass[pclass[i]]
);
}
}
model {
fsex ~ normal(0,sdsex);
fpclass ~ normal(0,sdpclass);
sdsex ~ normal(0,sd1);
sdpclass ~ normal(0,sd1);
sd1 ~ normal(0,3);
intercept ~ normal(0,1);
survived ~ bernoulli(expect);
}
Predictions
In this phase I have decided to make the predictions within the Stan program. The way which seemed to work is to duplicate all independent variables and do the predictions in the generated quantities section of the program. These predictions are actually probabilities of survival for each passenger for each MCMC sample. Hence a bit of post processing is used, the mean probability is calculated and a cut off of 0.5 is used to decide the final classification.
Since the Stan code runs pretty quickly after it has been compiled it is feasible to run this as a two stage process. A first run to examine the model parameters. A second run just to obtain the predictions.
Model 1
This is the parameter output of a model with just age and passenger class. It has been added here mostly to show a simple full coded example what the code looks like. With sdsex bigger than sdpclass it follows that sex had a bigger influence than class on the chance of survival.
Inference for Stan model: 8fb625a6ccf29aab919e1dcd494247aa.
4 chains, each with iter=1000; warmup=500; thin=1;
post-warmup draws per chain=500, total post-warmup draws=2000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
intercept -0.07 0.03 0.83 -1.68 -0.65 -0.07 0.49 1.61 670 1.00
fsex[1] 1.40 0.04 0.89 -0.46 0.83 1.40 1.98 3.07 533 1.01
fsex[2] -1.24 0.04 0.89 -3.11 -1.81 -1.23 -0.65 0.45 529 1.01
fpclass[1] 0.94 0.03 0.70 -0.40 0.50 0.93 1.33 2.34 536 1.01
fpclass[2] 0.12 0.03 0.69 -1.24 -0.31 0.11 0.52 1.55 527 1.01
fpclass[3] -0.93 0.03 0.69 -2.25 -1.35 -0.94 -0.50 0.47 499 1.01
sdsex 2.06 0.05 1.20 0.76 1.23 1.74 2.49 5.36 648 1.00
sdpclass 1.40 0.04 0.90 0.50 0.84 1.15 1.67 3.86 661 1.00
sd1 2.41 0.04 1.27 0.71 1.45 2.18 3.11 5.48 1071 1.00
lp__ -421.36 0.09 2.12 -426.25 -422.56 -421.00 -419.82 -418.19 616 1.00
Samples were drawn using NUTS(diag_e) at Sun Sep 13 12:28:49 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
Model 2
Model 2 has all linear effects in it. Predictions using this model were submitted, it gave a score of 0.78947 which was not an improvement over previous scores. I have one submission using bagging which did better, got same result with boosting and worse results with random forest.
Note that while this model looks pretty complex, this score was obtained without any interactions.
Inference for Stan model: 6278b3ebade9802ad9544b1242bada20.
4 chains, each with iter=1000; warmup=500; thin=1;
post-warmup draws per chain=500, total post-warmup draws=2000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
intercept 0.28 0.05 0.74 -1.20 -0.16 0.28 0.77 1.68 229 1.01
sd1 0.80 0.03 0.23 0.43 0.63 0.77 0.97 1.31 63 1.05
fsex[1] 0.24 0.03 0.56 -0.64 -0.01 0.13 0.39 1.85 353 1.01
fsex[2] -0.05 0.04 0.50 -1.29 -0.24 0.00 0.22 0.89 131 1.02
fpclass[1] 0.54 0.07 0.57 -0.39 0.15 0.53 0.90 1.77 60 1.03
fpclass[2] 0.15 0.06 0.50 -0.68 -0.27 0.16 0.50 1.15 76 1.01
fpclass[3] -0.78 0.05 0.48 -1.66 -1.10 -0.79 -0.47 0.18 102 1.01
fembarked[1] 0.26 0.02 0.33 -0.29 0.06 0.22 0.43 1.02 298 1.01
fembarked[2] 0.08 0.02 0.32 -0.55 -0.07 0.06 0.20 0.78 456 1.00
fembarked[3] -0.18 0.02 0.31 -0.80 -0.33 -0.18 -0.01 0.45 378 1.01
foe[1] -0.32 0.08 0.53 -1.63 -0.65 -0.28 0.04 0.54 41 1.07
foe[2] 0.04 0.02 0.43 -0.73 -0.11 0.01 0.19 1.02 433 1.01
foe[3] 0.34 0.03 0.48 -0.41 0.04 0.26 0.57 1.54 227 1.03
fcabchar[1] -0.25 0.18 0.46 -1.16 -0.49 -0.13 0.03 0.45 7 1.17
fcabchar[2] -0.04 0.03 0.35 -0.86 -0.16 -0.04 0.10 0.62 143 1.03
fcabchar[3] 0.07 0.01 0.27 -0.47 -0.06 0.05 0.18 0.75 325 1.03
fcabchar[4] -0.09 0.04 0.31 -0.85 -0.23 -0.03 0.08 0.48 55 1.05
fcabchar[5] 0.22 0.02 0.33 -0.32 0.00 0.17 0.40 1.03 219 1.02
fcabchar[6] 0.31 0.04 0.36 -0.25 0.02 0.25 0.59 1.12 76 1.03
fcabchar[7] -0.21 0.06 0.40 -1.03 -0.43 -0.11 0.04 0.45 44 1.05
fage[1] 0.14 0.06 0.32 -0.38 -0.04 0.04 0.28 0.78 27 1.10
fage[2] 0.45 0.14 0.59 -0.12 0.02 0.17 0.77 1.68 19 1.17
fage[3] 0.10 0.15 0.53 -0.74 -0.15 -0.01 0.12 1.37 13 1.24
fage[4] -0.10 0.01 0.31 -0.86 -0.25 -0.05 0.03 0.48 558 1.01
fage[5] 0.17 0.10 0.43 -0.52 -0.04 0.04 0.25 1.07 20 1.14
fage[6] 0.00 0.01 0.21 -0.42 -0.09 -0.01 0.09 0.47 250 1.01
fage[7] 0.09 0.03 0.20 -0.32 -0.01 0.05 0.22 0.47 50 1.05
fage[8] -0.22 0.03 0.31 -1.06 -0.34 -0.15 -0.01 0.17 129 1.03
fage[9] -0.02 0.07 0.39 -0.98 -0.16 -0.01 0.12 0.60 27 1.09
fage[10] 0.00 0.01 0.19 -0.43 -0.06 0.01 0.06 0.40 813 1.00
fticket[1] 0.07 0.03 0.21 -0.33 -0.06 0.04 0.22 0.51 54 1.05
fticket[2] -0.04 0.03 0.34 -0.87 -0.17 -0.01 0.17 0.59 108 1.02
fticket[3] -0.17 0.05 0.43 -1.23 -0.38 -0.09 0.06 0.60 69 1.04
fticket[4] 0.01 0.05 0.31 -0.48 -0.19 0.01 0.19 0.69 40 1.06
fticket[5] 0.02 0.02 0.36 -0.64 -0.13 -0.02 0.15 0.95 210 1.01
fticket[6] 0.03 0.05 0.32 -0.63 -0.15 0.02 0.21 0.55 37 1.07
fticket[7] 0.14 0.02 0.38 -0.40 -0.05 0.05 0.26 1.18 245 1.01
fticket[8] 0.01 0.01 0.35 -0.84 -0.12 0.02 0.20 0.73 929 1.01
fticket[9] 0.01 0.01 0.32 -0.69 -0.12 0.03 0.14 0.65 766 1.00
fticket[10] -0.23 0.04 0.46 -1.53 -0.39 -0.08 0.03 0.38 149 1.02
fticket[11] -0.02 0.01 0.32 -0.66 -0.16 -0.04 0.11 0.70 1001 1.01
fticket[12] 0.37 0.05 0.47 -0.16 0.03 0.20 0.57 1.58 82 1.04
fticket[13] -0.20 0.02 0.34 -1.05 -0.36 -0.14 0.01 0.36 221 1.03
ftitle[1] 1.26 0.08 0.73 -0.03 0.75 1.20 1.74 2.85 79 1.03
ftitle[2] 0.47 0.04 0.70 -1.05 0.13 0.45 0.87 1.79 372 1.01
ftitle[3] -2.27 0.03 0.66 -3.59 -2.59 -2.34 -1.88 -0.86 486 1.01
ftitle[4] 0.78 0.06 0.73 -0.86 0.35 0.89 1.25 2.10 144 1.01
fsibsp[1] 0.70 0.06 0.51 -0.33 0.36 0.69 1.09 1.69 77 1.04
fsibsp[2] 0.48 0.08 0.53 -0.62 0.14 0.49 0.93 1.46 49 1.06
fsibsp[3] 0.67 0.11 0.65 -0.60 0.23 0.61 1.16 1.85 35 1.07
fsibsp[4] -1.58 0.04 0.64 -2.89 -2.03 -1.52 -1.16 -0.38 292 1.02
fparch[1] 0.25 0.02 0.33 -0.25 0.03 0.22 0.39 1.02 450 1.01
fparch[2] 0.04 0.01 0.33 -0.53 -0.13 0.00 0.17 0.83 521 1.01
fparch[3] -0.03 0.06 0.38 -0.64 -0.22 -0.01 0.14 0.78 35 1.07
fparch[4] -0.47 0.18 0.54 -1.48 -0.81 -0.31 -0.02 0.22 9 1.12
ffare[1] -0.01 0.01 0.15 -0.39 -0.04 0.00 0.03 0.29 421 1.01
ffare[2] -0.02 0.01 0.15 -0.41 -0.06 0.00 0.02 0.26 419 1.01
ffare[3] -0.02 0.01 0.15 -0.35 -0.06 0.00 0.02 0.26 671 1.00
ffare[4] 0.00 0.01 0.16 -0.31 -0.05 0.00 0.04 0.36 586 1.01
ffare[5] 0.08 0.01 0.20 -0.15 -0.01 0.01 0.12 0.66 247 1.02
sdsex 0.51 0.04 0.44 0.03 0.21 0.35 0.68 1.64 123 1.01
sdpclass 0.75 0.04 0.36 0.32 0.48 0.68 0.94 1.59 64 1.04
sdembarked 0.43 0.02 0.32 0.06 0.21 0.39 0.53 1.24 328 1.01
sdoe 0.56 0.05 0.38 0.06 0.32 0.47 0.73 1.51 48 1.05
sdcabchar 0.39 0.03 0.26 0.03 0.18 0.36 0.59 0.97 61 1.04
sdage 0.33 0.06 0.29 0.02 0.09 0.23 0.53 0.89 23 1.15
sdticket 0.35 0.03 0.24 0.04 0.19 0.30 0.45 0.99 68 1.05
sdtitle 1.29 0.02 0.35 0.75 1.07 1.19 1.47 2.15 320 1.01
sdsibsp 1.00 0.02 0.37 0.49 0.74 0.95 1.16 1.95 222 1.01
sdparch 0.47 0.12 0.37 0.01 0.16 0.38 0.73 1.25 10 1.12
sdfare 0.14 0.02 0.17 0.00 0.02 0.09 0.19 0.60 92 1.03
lp__ -342.28 2.74 15.87 -372.33 -353.23 -344.26 -330.95 -310.12 34 1.09
Samples were drawn using NUTS(diag_e) at Sun Aug 30 13:00:19 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
Code
Data reading
# preparation and data reading section
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
# read and combine
train <- read.csv('train.csv')
train$status <- 'train'
test <- read.csv('test.csv')
test$status <- 'test'
test$Survived <- NA
tt <- rbind(test,train)
# generate variables
tt$Embarked[tt$Embarked==''] <- 'S'
tt$Embarked <- factor(tt$Embarked)
tt$Pclass <- factor(tt$Pclass)
tt$Survived <- factor(tt$Survived)
tt$age <- tt$Age
tt$age[is.na(tt$age)] <- 999
tt$age <- cut(tt$age,c(0,2,5,9,12,15,21,55,65,100,1000))
tt$Title <- sapply(tt$Name,function(x) strsplit(as.character(x),'[.,]')[[1]][2])
tt$Title <- gsub(' ','',tt$Title)
tt$Title[tt$Title=='Dr' & tt$Sex=='female'] <- 'Miss'
tt$Title[tt$Title %in% c('Capt','Col','Don','Sir','Jonkheer','Major','Rev','Dr')] <- 'Mr'
tt$Title[tt$Title %in% c('Lady','Ms','theCountess','Mlle','Mme','Ms','Dona')] <- 'Miss'
tt$Title <- factor(tt$Title)
# changed cabin character
tt$cabchar <- substr(tt$Cabin,1,1)
tt$cabchar[tt$cabchar %in% c('F','G','T')] <- 'X';
tt$cabchar <- factor(tt$cabchar)
tt$ncabin <- nchar(as.character(tt$Cabin))
tt$cn <- as.numeric(gsub('[[:space:][:alpha:]]','',tt$Cabin))
tt$oe <- factor(ifelse(!is.na(tt$cn),tt$cn%%2,-1))
tt$Fare[is.na(tt$Fare)]<- median(tt$Fare,na.rm=TRUE)
tt$ticket <- sub('[[:digit:]]+$','',tt$Ticket)
tt$ticket <- toupper(gsub('(\\.)|( )|(/)','',tt$ticket))
tt$ticket[tt$ticket %in% c('A2','A4','AQ3','AQ4','AS')] <- 'An'
tt$ticket[tt$ticket %in% c('SCA3','SCA4','SCAH','SC','SCAHBASLE','SCOW')] <- 'SC'
tt$ticket[tt$ticket %in% c('CASOTON','SOTONO2','SOTONOQ')] <- 'SOTON'
tt$ticket[tt$ticket %in% c('STONO2','STONOQ')] <- 'STON'
tt$ticket[tt$ticket %in% c('C')] <- 'CA'
tt$ticket[tt$ticket %in% c('SOC','SOP','SOPP')] <- 'SOP'
tt$ticket[tt$ticket %in% c('SWPP','WC','WEP')] <- 'W'
tt$ticket[tt$ticket %in% c('FA','FC','FCC')] <- 'F'
tt$ticket[tt$ticket %in% c('PP','PPP','LINE','LP','SP')] <- 'PPPP'
tt$ticket <- factor(tt$ticket)
tt$fare <- cut(tt$Fare,breaks=c(min(tt$Fare)-1,quantile(tt$Fare,seq(.2,.8,.2)),max(tt$Fare)+1))
train <- tt[tt$status=='train',]
test <- tt[tt$status=='test',]
#end of preparation and data reading
options(width=90)
First model
datain <- list(
survived = c(0,1)[train$Survived],
ntrain = nrow(train),
ntest=nrow(test),
sex=c(1:2)[train$Sex],
psex=c(1:2)[test$Sex],
pclass=c(1:3)[train$Pclass],
ppclass=c(1:3)[test$Pclass]
)
parameters=c('intercept','fsex','fpclass','sdsex','sdpclass','sd1')
my_code <- '
data {
int<lower=0> ntrain;
int<lower=0> ntest;
int survived[ntrain];
int <lower=1,upper=2> sex[ntrain];
int <lower=1,upper=2> psex[ntest];
int <lower=1,upper=3> pclass[ntrain];
int <lower=1,upper=3> ppclass[ntest];
}
parameters {
real fsex[2];
real intercept;
real fpclass[3];
real <lower=0> sdsex;
real <lower=0> sdpclass;
real <lower=0> sd1;
}
transformed parameters {
real expect[ntrain];
for (i in 1:ntrain) {
expect[i] <- inv_logit(
intercept+
fsex[sex[i]]+
fpclass[pclass[i]]
);
}
}
model {
fsex ~ normal(0,sdsex);
fpclass ~ normal(0,sdpclass);
sdsex ~ normal(0,sd1);
sdpclass ~ normal(0,sd1);
sd1 ~ normal(0,3);
intercept ~ normal(0,1);
survived ~ bernoulli(expect);
}
generated quantities {
real pred[ntest];
for (i in 1:ntest) {
pred[i] <- inv_logit(
intercept+
fsex[psex[i]]+
fpclass[ppclass[i]]
);
}
}
'
fit1 <- stan(model_code = my_code,
data = datain,
pars=parameters,
iter = 1000,
chains = 4,
open_progress=FALSE)
fit1
Second model
datain <- list(
survived = c(0,1)[train$Survived],
ntrain = nrow(train),
ntest=nrow(test),
sex=c(1:2)[train$Sex],
psex=c(1:2)[test$Sex],
pclass=c(1:3)[train$Pclass],
ppclass=c(1:3)[test$Pclass],
embarked=c(1:3)[train$Embarked],
pembarked=c(1:3)[test$Embarked],
oe=c(1:3)[train$oe],
poe=c(1:3)[test$oe],
cabchar=c(1:7)[train$cabchar],
pcabchar=c(1:7)[test$cabchar],
age=c(1:10)[train$age],
page=c(1:10)[test$age],
ticket=c(1:13)[train$ticket],
pticket=c(1:13)[test$ticket],
title=c(1:4)[train$Title],
ptitle=c(1:4)[test$Title],
sibsp=c(1:4,rep(4,6))[train$SibSp+1],
psibsp=c(1:4,rep(4,6))[test$SibSp+1],
parch=c(1:4,rep(4,6))[train$Parch+1],
pparch=c(1:4,rep(4,6))[test$Parch+1],
fare=c(1:5)[train$fare],
pfare=c(1:5)[test$fare]
)
parameters=c('intercept','sd1',
'fsex','fpclass','fembarked',
'foe','fcabchar','fage',
'fticket','ftitle',
'fsibsp','fparch',
'ffare',
'sdsex','sdpclass','sdembarked',
'sdoe','sdcabchar','sdage',
'sdticket','sdtitle',
'sdsibsp','sdparch',
'sdfare')
my_code <- '
data {
int<lower=0> ntrain;
int<lower=0> ntest;
int survived[ntrain];
int <lower=1,upper=2> sex[ntrain];
int <lower=1,upper=2> psex[ntest];
int <lower=1,upper=3> pclass[ntrain];
int <lower=1,upper=3> ppclass[ntest];
int <lower=1,upper=3> embarked[ntrain];
int <lower=1,upper=3> pembarked[ntest];
int <lower=1,upper=3> oe[ntrain];
int <lower=1,upper=3> poe[ntest];
int <lower=1,upper=7> cabchar[ntrain];
int <lower=1,upper=7> pcabchar[ntest];
int <lower=1,upper=10> age[ntrain];
int <lower=1,upper=10> page[ntest];
int <lower=1,upper=13> ticket[ntrain];
int <lower=1,upper=13> pticket[ntest];
int <lower=1,upper=4> title[ntrain];
int <lower=1,upper=4> ptitle[ntest];
int <lower=1,upper=4> sibsp[ntrain];
int <lower=1,upper=4> psibsp[ntest];
int <lower=1,upper=4> parch[ntrain];
int <lower=1,upper=4> pparch[ntest];
int <lower=1,upper=5> fare[ntrain];
int <lower=1,upper=5> pfare[ntest];
}
parameters {
real fsex[2];
real intercept;
real fpclass[3];
real fembarked[3];
real foe[3];
real fcabchar[7];
real fage[10];
real fticket[13];
real ftitle[4];
real fparch[4];
real fsibsp[4];
real ffare[5];
real <lower=0> sdsex;
real <lower=0> sdpclass;
real <lower=0> sdembarked;
real <lower=0> sdoe;
real <lower=0> sdcabchar;
real <lower=0> sdage;
real <lower=0> sdticket;
real <lower=0> sdtitle;
real <lower=0> sdparch;
real <lower=0> sdsibsp;
real <lower=0> sdfare;
real <lower=0> sd1;
}
transformed parameters {
real expect[ntrain];
for (i in 1:ntrain) {
expect[i] <- inv_logit(
intercept+
fsex[sex[i]]+
fpclass[pclass[i]]+
fembarked[embarked[i]]+
foe[oe[i]]+
fcabchar[cabchar[i]]+
fage[age[i]]+
fticket[ticket[i]]+
ftitle[title[i]]+
fsibsp[sibsp[i]]+
fparch[parch[i]]+
ffare[fare[i]]
);
}
}
model {
fsex ~ normal(0,sdsex);
fpclass ~ normal(0,sdpclass);
fembarked ~ normal(0,sdembarked);
foe ~ normal(0,sdoe);
fcabchar ~ normal(0,sdcabchar);
fage ~ normal(0,sdage);
fticket ~ normal(0,sdticket);
ftitle ~ normal(0,sdtitle);
fsibsp ~ normal(0,sdsibsp);
fparch ~ normal(0,sdparch);
ffare ~ normal(0,sdfare);
sdsex ~ normal(0,sd1);
sdpclass ~ normal(0,sd1);
sdembarked ~ normal(0,sd1);
sdoe ~ normal(0,sd1);
sdcabchar ~ normal(0,sd1);
sdage ~ normal(0,sd1);
sdticket ~ normal(0,sd1);
sdtitle ~ normal(0,sd1);
sdsibsp ~ normal(0,sd1);
sdparch ~ normal(0,sd1);
sdfare ~ normal(0,sd1);
sd1 ~ normal(0,1);
intercept ~ normal(0,1);
survived ~ bernoulli(expect);
}
generated quantities {
real pred[ntest];
for (i in 1:ntest) {
pred[i] <- inv_logit(
intercept+
fsex[psex[i]]+
fpclass[ppclass[i]]+
fembarked[pembarked[i]]+
foe[poe[i]]+
fcabchar[pcabchar[i]]+
fage[page[i]]+
fticket[pticket[i]]+
ftitle[ptitle[i]]+
fsibsp[psibsp[i]]+
fparch[pparch[i]]+
ffare[pfare[i]]
);
}
}
'
fit1 <- stan(model_code = my_code,
data = datain,
pars=parameters,
iter = 1000,
chains = 4,
open_progress=FALSE)
fit1
#plot(fit1,ask=TRUE)
#traceplot(fit1,ask=TRUE)
fit2 <- stan(model_code = my_code,
data = datain,
fit=fit1,
pars=c('pred'),
iter = 2000,
warmup =200,
chains = 4,
open_progress=FALSE)
fit3 <- as.matrix(fit2)[,-419]
#plots of individual passengers
#plot(density(fit3[,1]))
#plot(density(fit3[,18]))
#plot(density(as.numeric(fit3),adjust=.3))
decide1 <- apply(fit3,2,function(x) mean(x)>.5)
decide2 <- apply(fit3,2,function(x) median(x)>.5)
#table(decide1,decide2)
out <- data.frame(
PassengerId=test$PassengerId,
Survived=as.numeric(decide1),
row.names=NULL)
write.csv(x=out,
file='stanlin.csv',
row.names=FALSE,
quote=FALSE)