Sourcing the data
In contrast to my normal posts I won't be putting code for data scraping here. The reason is purely ethical; should I make it all too easy to get these data? I am not so sure webmasters like that idea. I took the data from http://www.spritmonitor.de/. The scraping techniques I took from r-bloggers, original was on statistical-research.com. At the end I had data where first records and relevant variables looked like shown below. The data was selected from 2010 to now with engine sizes 55, 70 and 96 kW. Engine sizes which were one kW off, were merged to the stated levels, because I had the feeling these were errors. In the end this resulted in 3713 records and 106 drivers.
usage
Driver Month lp100 buildyear engine
1 63 06 4.38 BJ 2011 70 kW (95 PS)
2 63 06 3.55 BJ 2011 70 kW (95 PS)
3 63 05 4.16 BJ 2011 70 kW (95 PS)
4 63 05 3.50 BJ 2011 70 kW (95 PS)
5 63 05 3.90 BJ 2011 70 kW (95 PS)
6 63 04 3.54 BJ 2011 70 kW (95 PS)
Model
The model used makes usage, liters per 100 km (lp100), a function of engine type, Month, driver and buildyear. It was also chosen to make the driver's standard deviation part of the model. The model is estimated in JAGS. The code is at the bottom of the post. Below the plot from jagsfit. From my point of view this looks reasonable. The trend at drivers is because the data was sorted when scraped.
Engines
The data shown are conditional on their own drivers. It seems the strongest engine has the highest usage, while the weakest engine has second worst performance. This almost suggests that the 55 kW engine is just too small.
Months
The data shown is the average usage over all drivers and cars. Summer is cheaper to drive than winter. However, the difference is smaller than between engines.
Drivers
Drivers' means are conditional on their own engines, averaged over months and build years. In a sense, these are the expected long term usages, with shrinkage to the population means for drivers with limited data. It is easy possible to have a style for more than 6 liter per 100 km, one extreme driver expects more than 7 liter per 100 km, with the most economic car (upon examining the data notes, it may be this driver has loads of within city driving, average low speed, maybe loads of start/stop with air conditioning on). On the other hand, some drivers manage clearly less than 4.5 liter per 100 km. 3.8 l/100 km is what the best driver observes as his/her personal average, in the plot it is slightly shrunk upwards.
Factory results and driver results
Based on these data one can certainly expect differences between what a factory presents as data and what is observed. If September weather gives 0.2 l/100 km less usage than December, the lower value will be given. The driving cycle for fuel usage has, as I read it, a fairly sedate speed. So, close or better than the best driver. The average driver by definition has a much higher usage than the best driver. A lease driver might be among the worst performers. After all, not paying your own fuel or repairs does not make for austere driving.
R code
library(R2jags)data_list <- list(NDriver=nlevels(usage$Driver),
NBY=nlevels(usage$buildyear),
NEngine=nlevels(usage$engine),
NMonth=nlevels(usage$Month),
N=nrow(usage))
data_list$Driver <- (1:data_list$NDriver)[usage$Driver]
data_list$BY <- (1:data_list$NBY)[usage$buildyear]
data_list$Engine<- (1:data_list$NEngine)[usage$engine]
data_list$Month <- (1:data_list$NMonth)[usage$Month]
data_list$lp100 <- usage$lp100
uu <- unique(data.frame(engine=usage$engine,cDriver=usage$Driver,Driver=data_list$Driver,Engine=data_list$Engine))
data_list$uDriver <- uu$Driver
data_list$uEngine <- uu$Engine
data_list$Driver1 <- uu$Driver[uu$Engine==1]
data_list$Driver2 <- uu$Driver[uu$Engine==2]
data_list$Driver3 <- uu$Driver[uu$Engine==3]
data_list$NDriver1 <- length(data_list$Driver1)
data_list$NDriver2 <- length(data_list$Driver2)
data_list$NDriver3 <- length(data_list$Driver3)
mymod <- function() {
for (i in 1:N) {
fit[i] <- grandmean +
MMonth[Month[i]] +
MEngine[Engine[i]] +
MBY[BY[i]] +
MDriver[Driver[i]]
lp100[i] ~dnorm(fit[i],tauOfDriver[Driver[i]])
}
grandmean ~ dnorm(0,.001)
MMonth[1] <- 0
for (i in 2:NMonth) {
MMonth[i] ~ dnorm(0,tauMonth)
}
tauMonth ~ dgamma(0.001,0.001)
sdMonth <- sqrt(1/tauMonth)
MBY[1] <- 0
for (i in 2:NBY) {
MBY[i] ~ dnorm(0,tauBY)
}
tauBY ~ dgamma(0.001,0.001)
sdBY <- sqrt(1/tauBY)
MEngine[1] <- 0
for (i in 2:NEngine) {
MEngine[i] ~ dnorm(0,tauEngine)
}
tauEngine ~ dgamma(0.001,0.001)
sdEngine <- sqrt(1/tauEngine)
MDriver[1] <- 0
for (i in 2:NDriver) {
MDriver[i] ~ dnorm(0,tauDriver)
}
tauDriver ~ dgamma(0.001,0.001)
sdDriver <- sqrt(1/tauDriver)
for (i in 1:NDriver) {
tauOfDriver[i] <- pow(sdOfDriver[i],-2)
sdOfDriver[i] ~ dlnorm(mu.lsdD,tau.lsdD)
}
mu.lsdD ~ dnorm(0,.0001)
tau.lsdD <- pow(sigma.lsdD,-2)
sigma.lsdD ~ dunif(0,100)
for (i in 1:NDriver) {
meanDriver[i] <- grandmean + MDriver[uDriver[i]] + mean(MMonth[1:NMonth]) + mean(MBY[1:NBY]) + MEngine[uEngine[i]]
}
for (i in 1:NMonth) {
meanMonth[i] <- grandmean + mean(MDriver[1:NDriver]) + MMonth[i] + mean(MBY[1:NBY]) + mean(MEngine[1:NEngine])
}
for (i in 1:NBY) {
meanBY[i] <- grandmean + mean(MDriver[1:NDriver]) + mean(MMonth[1:NMonth]) +MBY[i]+ mean(MEngine[1:NEngine])
}
for ( i in 1:NDriver1) {
MDriver1[i] <- MDriver[Driver1[i]]
}
for ( i in 1:NDriver2) {
MDriver2[i] <- MDriver[Driver2[i]]
}
for ( i in 1:NDriver3) {
MDriver3[i] <- MDriver[Driver3[i]]
}
meanEngine[1] <- grandmean + mean(MDriver1) + mean(MMonth[1:NMonth])+ mean(MBY[1:NBY]) +MEngine[1]
meanEngine[2] <- grandmean + mean(MDriver2) + mean(MMonth[1:NMonth])+ mean(MBY[1:NBY]) +MEngine[2]
meanEngine[3] <- grandmean + mean(MDriver3) + mean(MMonth[1:NMonth])+ mean(MBY[1:NBY]) +MEngine[3]
}
model.file <- file.path(tempdir(),'mijnmodel.txt')
write.model(mymod,con=model.file)
inits <- function() list(
grandmean = rnorm(1,5,1),
MDriver = c(NA,rnorm(data_list$NDriver-1)) ,
MBY = c(NA,rnorm(data_list$NBY-1)) ,
MEngine = c(NA,rnorm(data_list$NEngine-1)) ,
MMonth = c(NA,rnorm(data_list$NMonth-1)) ,
tauDriver = runif(1,1,3),
tauBY = runif(1,1,3),
tauEngine=runif(1,1,3),
tauMonth=runif(1,1,3),
sdOfDriver = runif(data_list$NDriver,1,3)
)
parameters <- c('sdOfDriver','meanMonth','meanDriver','meanBY',
'meanEngine','sdDriver','sdMonth','sdEngine')
jagsfit <- jags(data=data_list,inits=inits,model.file=model.file,
parameters.to.save=parameters,n.chains=4,DIC=FALSE,n.iter=20000,
progress.bar='none')
png('jagsfit.png')
plot(jagsfit)
dev.off()
jagsfit.mc <- as.mcmc(jagsfit)
fitsummary <- summary(jagsfit.mc)
stats <- fitsummary$statistics
quants <- fitsummary$quantiles
library(ggplot2)
Engines <- as.data.frame(cbind(stats[grepl('meanEngine',rownames(stats)),],
quants[grepl('meanEngine',rownames(quants)),]))
Engines$type <- levels(usage$engine)
Engines
limits <- aes(ymax = `25%`, ymin=`75%`)
p <- ggplot(Engines, aes(y=Mean, x=type))
dodge <- position_dodge(width=0.9)
png('engines.png')
p + geom_point(position=dodge) +
geom_errorbar(limits, position=dodge, width=0.25) +
coord_flip()
dev.off()
Months <- as.data.frame(cbind(stats[grepl('meanMonth',rownames(stats)),],
quants[grepl('meanMonth',rownames(quants)),]))
Months$Month <- levels(usage$Month)
limits <- aes(ymax = `25%`, ymin=`75%`)
p <- ggplot(Months, aes(y=Mean, x=Month))
dodge <- position_dodge(width=0.9)
png('Months.png')
p + geom_point(position=dodge) +
geom_errorbar(limits, position=dodge, width=0.25) +
coord_flip()
dev.off()
Driver <- as.data.frame(cbind(stats[grepl('meanDriver',rownames(stats)),],
quants[grepl('meanDriver',rownames(quants)),]))
Driver <- cbind(uu,Driver)
limits <- aes(ymax = `25%`, ymin=`75%`)
p <- ggplot(Driver, aes(y=Mean, x=Driver,colour=engine))
dodge <- position_dodge(width=0.9)
png('driver.png')
p + geom_point(position=dodge) +
geom_errorbar(limits, position=dodge, width=0.25) +
coord_flip()
dev.off()
ag <- aggregate(usage$lp100,by=list(Driver=usage$Driver),mean)
min(ag$x)
No comments:
Post a Comment