Wiekvoet: June 2013

Sunday, June 30, 2013

Faster calculation

Last week I decided to speed up my distribution fitting functions of two weeks ago. These were bold words. My try of Rcpp was a failure. Just plain optimization helped a bit better. Using the compiler package added a bit. (the compiler package does not compile, but a 10% speed increase at minimum costs is never bad).

Rcpp (with inline)

Rcpp had a number of problems. First of all, it did not work on my Windows box. It ended with an error message:

Error in compileCode(f, code, language = language, verbose = verbose) :

Compilation ERROR, function(s)/method(s) not created!

In addition: Warning message:

running command 'C:/Users/Kees/Documents/R/R-3.0.1/bin/x64/R CMD SHLIB file12102de67044.cpp 2> file12102de67044.cpp.err.txt' had status 1

As far as I understand the .cpp is not found. Which is not strange, it resides in a temp directory:

C:\Users\Kees\AppData\Local\Temp\Rtmp25t6tT allocated for the current session. I gave up after playing around for half an evening and went to plan B, a Fedora box, which worked without a problem. I have no C++ experience, my C time is like 20 years ago. So using some examples I was happy when I converted a few lines of code. The part was were chosen for simplicity rather than speed expectations, still, I used 97% of the original time over the whole function. Subsequently I tried a larger part, where a data.frame was converted. Unhappily, the whole function used 200% of the original time. It seemed there suddenly appeared a try/tryCatch which consumed more time than the original code. At this point I started to abandon Rcpp and tried more ordinary approaches. Learning Rcpp is now a bit on the back burner.

Code optimization

This is actually fairly simple. Vectors beat for loops, especially the inner vectors. Remove stuff from inner loops to outer loops. Take functions called and extract parts needed. For example, these two functions do the same, yet the first is much slower. Adding the compiler gives a little extra

nspacejump0 <- function(x,i,sd) {

xold <- arm::logit(x[i])

xnew <- arm::invlogit(xold + rnorm(1,0,sd))

x <- x/sum(x[-i])*(1-xnew)

x[i] <- xnew

}

nspacejump1 <- function(x,i,sd) {

xold <- log(x[i]/(1-x[i])) + rnorm(1,0,sd)

xnew <- 1/(1+exp(-xold))

x <- x/sum(x[-i])*(1-xnew)

x[i] <- xnew

}

nspacejump0c <- cmpfun(nspacejump0)

nspacejump1c <- cmpfun(nspacejump1)

############### speed run ####################

microbenchmark(

nspacejump0(c(.2,.4,.1,.3),1,.1),

nspacejump1(c(.2,.4,.1,.3),1,.1),

nspacejump0c(c(.2,.4,.1,.3),1,.1),

nspacejump1c(c(.2,.4,.1,.3),1,.1))

Unit: microseconds

expr min lq median uq

nspacejump0(c(0.2, 0.4, 0.1, 0.3), 1, 0.1) 111.428 114.360 115.8265 117.659

nspacejump1(c(0.2, 0.4, 0.1, 0.3), 1, 0.1) 29.323 30.789 31.5230 32.988

nspacejump0c(c(0.2, 0.4, 0.1, 0.3), 1, 0.1) 101.165 104.097 106.2960 107.763

nspacejump1c(c(0.2, 0.4, 0.1, 0.3), 1, 0.1) 19.793 21.259 21.9930 22.726

max neval

287.367 100

79.173 100

178.138 100

108.495 100

Second part code optimization

In the deviance calculation there are more possibilities for short cuts. The original calculates expected proportions from limits. In the end, this can be moved to a more outer loop, since only one distribution is modified at a time. On top of that (not shown because it was moved out) pnorm(upper,mean,sd)-pnorm(lower,mean,sd) has almost the same limits and became something like: pn<-pnorm(limits,mean,sd);pn[-1]-pn[-23]. Dmultinom has been eliminated, and since lgamma(x + 1) is only a function of x, the counts, this is constant for the data used, so can be moved completely outside. Obviously it again helps to use the compiler.

curldev0 <- function(counts,curmodel) {

dsd <- sum(dnorm(curmodel$sd,300,300,log=TRUE))

prop.exp <- matrix(0,nrow=nrow(counts),ncol=nrow(curmodel))

for (i in 1:nrow(curmodel)) {

prop.exp[,i] <- curmodel$w[i]*(pnorm(counts$upper,curmodel$mean[i],curmodel$sd[i])-

pnorm(counts$lower,curmodel$mean[i],curmodel$sd[i]))

}

finprop <- rowSums(prop.exp)

dsd + dmultinom(counts$Waarde,prob=finprop,log=TRUE)

}

curldev5 <- function(counts,curmodel,pemodel) {

dsd <- sum(dnorm(curmodel$sd,300,300,log=TRUE))

prop.exp <- pemodel %*% curmodel$w

prop.exp <- prop.exp/sum(prop.exp)

dsd + lgamma(sum(counts$Waarde) + 1) + sum(counts$Waarde * log(prop.exp) - counts$lgam)

#dmultinom(counts$Waarde,prob=prop.exp,log=TRUE)

}

##########speed run############

curldev0c <- cmpfun(curldev0)

curldev5c <- cmpfun(curldev5)

curmodel <- data.frame(

mean=c( 892.0141263, 1.417520e+03, 1306.8248062, 1.939645e+03),

sd =c( 99.5660288, 2.129247e+02, 194.2452168, 1.684932e+02),

w =c( 0.2252041, 4.384217e-02, 0.7125014, 1.845233e-02))

pemodel <- matrix(0,nrow=22,ncol=nrow(curmodel))

for (i in 1:nrow(curmodel)) {

pn <- pnorm(weightlims,curmodel$mean[i],curmodel$sd[i])

pemodel[,i] <- (pn[-1]-pn[-23])

}

microbenchmark(

# c(

curldev0(weight3,curmodel),

curldev5(weight3,curmodel,pemodel),

curldev0c(weight3,curmodel),

curldev5c(weight3,curmodel,pemodel)

)

Unit: microseconds

expr min lq median uq

curldev0(weight3, curmodel) 435.448 442.4130 447.1780 465.5050

curldev5(weight3, curmodel, pemodel) 58.646 61.5785 68.1760 71.8420

curldev0c(weight3, curmodel) 394.396 402.4600 407.9585 426.6515

curldev5c(weight3, curmodel, pemodel) 54.248 57.1800 63.0450 67.4430

max neval

1625.234 100

361.407 100

823.980 100

164.943 100

third part code optimization

This is the calling environment, so downstream adaptations have adaptations here. In the end the time used is about 30% of the original.

version0 <- function(weight3,curmodel,niter) {

cnow <- curldev0(weight3,curmodel)

ndist <- nrow(curmodel)

result <- matrix(nrow=ndist*niter,ncol=3*ndist+1)

index <- 1

for (iter in 1:niter) {

for (dist in 1:ndist) {

nowmodel <- curmodel

result[index,] <- c(unlist(curmodel),cnow)

curmodel$mean[dist] <- curmodel$mean[dist] + rnorm(1,0,1)

curmodel$sd[dist] <- curmodel$sd[dist] * runif(1,1/1.01,1.01)

curmodel$w <- nspacejump0(curmodel$w,dist,.1)

cnew <- curldev0(weight3,curmodel)

# cat(c(cnow,cnew,'\n'))

if (exp(cnew-cnow) > runif(1) ) cnow <- cnew

else curmodel <- nowmodel

index <- index + 1

}

return(list(result,curmodel))

}

version6 <- function(weight3,curmodel,niter) {

pemodel <- matrix(0,nrow=22,ncol=nrow(curmodel))

pemodel2 <- pemodel

for (i in 1:nrow(curmodel)) {

pn <- pnorm(weightlims,curmodel$mean[i],curmodel$sd[i])

pemodel[,i] <- (pn[-1]-pn[-23])

}

cnow <- curldev5c(weight3,curmodel,pemodel)

ndist <- nrow(curmodel)

result <- matrix(nrow=ndist*niter,ncol=3*ndist+1)

index <- 1

for (iter in 1:niter) {

for (dist in 1:ndist) {

nowmodel <- curmodel

nowpe <- pemodel

result[index,] <- c(curmodel$mean,curmodel$sd,curmodel$w,cnow)

curmodel$mean[dist] <- curmodel$mean[dist] + rnorm(1,0,1)

curmodel$sd[dist] <- curmodel$sd[dist] * runif(1,1/1.01,1.01)

curmodel$w <- nspacejump1c(curmodel$w,dist,.1)

pn <- pnorm(weightlims,curmodel$mean[dist],curmodel$sd[dist])

pemodel[,dist] <- (pn[-1]-pn[-23])

cnew <- curldev5c(weight3,curmodel,pemodel)

# cat(c(cnow,cnew,'\n'))

if (exp(cnew-cnow) > runif(1) ) cnow <- cnew

else {curmodel <- nowmodel

pemodel <- nowpe }

index <- index + 1

}

return(list(result=result,curmodel=curmodel))

}

############### speed run ###############

version0c <- cmpfun(version0)

version6c <- cmpfun(version6)

microbenchmark(

version0(weight3,curmodel,200),

version6(weight3,curmodel,200),

version0c(weight3,curmodel,200),

version6c(weight3,curmodel,200),

times=5)

Unit: milliseconds

expr min lq median uq max

version0(weight3, curmodel, 200) 648.9325 655.1388 660.7138 665.8644 669.2029

version6(weight3, curmodel, 200) 264.1112 266.0451 267.8924 270.6649 274.3435

version0c(weight3, curmodel, 200) 617.1030 627.7700 630.9208 638.2391 645.8507

version6c(weight3, curmodel, 200) 230.3676 230.6623 231.3389 236.2792 237.2043

neval

Monday, June 24, 2013

Opel Corsa Diesel Usage

I wanted to extend my car weight distribution calculation of June 16 from only 2000 to years 2000 to 2013. Unfortunately, come Sunday afternoon the code seemed too slow and not even the beginning of a post. So, I went on to another calculation I was working on; scraping Diesel usage data from the web and examining how it varied. The background of this is the repeated news items one reads, which state that actual fuel usage is much higher than reported. Since I may be in the market for an Opel Corsa Diesel, this was the car of choice.

Sourcing the data

In contrast to my normal posts I won't be putting code for data scraping here. The reason is purely ethical; should I make it all too easy to get these data? I am not so sure webmasters like that idea. I took the data from http://www.spritmonitor.de/. The scraping techniques I took from r-bloggers, original was on statistical-research.com. At the end I had data where first records and relevant variables looked like shown below. The data was selected from 2010 to now with engine sizes 55, 70 and 96 kW. Engine sizes which were one kW off, were merged to the stated levels, because I had the feeling these were errors. In the end this resulted in 3713 records and 106 drivers.

usage

Driver Month lp100 buildyear engine

1 63 06 4.38 BJ 2011 70 kW (95 PS)

2 63 06 3.55 BJ 2011 70 kW (95 PS)

3 63 05 4.16 BJ 2011 70 kW (95 PS)

4 63 05 3.50 BJ 2011 70 kW (95 PS)

5 63 05 3.90 BJ 2011 70 kW (95 PS)

6 63 04 3.54 BJ 2011 70 kW (95 PS)

Model

The model used makes usage, liters per 100 km (lp100), a function of engine type, Month, driver and buildyear. It was also chosen to make the driver's standard deviation part of the model. The model is estimated in JAGS. The code is at the bottom of the post. Below the plot from jagsfit. From my point of view this looks reasonable. The trend at drivers is because the data was sorted when scraped.

Engines

The data shown are conditional on their own drivers. It seems the strongest engine has the highest usage, while the weakest engine has second worst performance. This almost suggests that the 55 kW engine is just too small.

Months

The data shown is the average usage over all drivers and cars. Summer is cheaper to drive than winter. However, the difference is smaller than between engines.

Drivers

Drivers' means are conditional on their own engines, averaged over months and build years. In a sense, these are the expected long term usages, with shrinkage to the population means for drivers with limited data. It is easy possible to have a style for more than 6 liter per 100 km, one extreme driver expects more than 7 liter per 100 km, with the most economic car (upon examining the data notes, it may be this driver has loads of within city driving, average low speed, maybe loads of start/stop with air conditioning on). On the other hand, some drivers manage clearly less than 4.5 liter per 100 km. 3.8 l/100 km is what the best driver observes as his/her personal average, in the plot it is slightly shrunk upwards.

Factory results and driver results

Based on these data one can certainly expect differences between what a factory presents as data and what is observed. If September weather gives 0.2 l/100 km less usage than December, the lower value will be given. The driving cycle for fuel usage has, as I read it, a fairly sedate speed. So, close or better than the best driver. The average driver by definition has a much higher usage than the best driver. A lease driver might be among the worst performers. After all, not paying your own fuel or repairs does not make for austere driving.

R code

library(R2jags)
data_list <- list(NDriver=nlevels(usage$Driver),
NBY=nlevels(usage$buildyear),
NEngine=nlevels(usage$engine),
NMonth=nlevels(usage$Month),
N=nrow(usage))

data_list$Driver <- (1:data_list$NDriver)[usage$Driver]
data_list$BY <- (1:data_list$NBY)[usage$buildyear]
data_list$Engine<- (1:data_list$NEngine)[usage$engine]
data_list$Month <- (1:data_list$NMonth)[usage$Month]
data_list$lp100 <- usage$lp100
uu <- unique(data.frame(engine=usage$engine,cDriver=usage$Driver,Driver=data_list$Driver,Engine=data_list$Engine))
data_list$uDriver <- uu$Driver
data_list$uEngine <- uu$Engine
data_list$Driver1 <- uu$Driver[uu$Engine==1]
data_list$Driver2 <- uu$Driver[uu$Engine==2]
data_list$Driver3 <- uu$Driver[uu$Engine==3]
data_list$NDriver1 <- length(data_list$Driver1)
data_list$NDriver2 <- length(data_list$Driver2)
data_list$NDriver3 <- length(data_list$Driver3)
mymod <- function() {
for (i in 1:N) {
fit[i] <- grandmean +
MMonth[Month[i]] +
MEngine[Engine[i]] +
MBY[BY[i]] +
MDriver[Driver[i]]
lp100[i] ~dnorm(fit[i],tauOfDriver[Driver[i]])
}
grandmean ~ dnorm(0,.001)
MMonth[1] <- 0
for (i in 2:NMonth) {
MMonth[i] ~ dnorm(0,tauMonth)
}
tauMonth ~ dgamma(0.001,0.001)
sdMonth <- sqrt(1/tauMonth)
MBY[1] <- 0
for (i in 2:NBY) {
MBY[i] ~ dnorm(0,tauBY)
}
tauBY ~ dgamma(0.001,0.001)
sdBY <- sqrt(1/tauBY)
MEngine[1] <- 0
for (i in 2:NEngine) {
MEngine[i] ~ dnorm(0,tauEngine)
}
tauEngine ~ dgamma(0.001,0.001)
sdEngine <- sqrt(1/tauEngine)

MDriver[1] <- 0
for (i in 2:NDriver) {
MDriver[i] ~ dnorm(0,tauDriver)
}
tauDriver ~ dgamma(0.001,0.001)
sdDriver <- sqrt(1/tauDriver)
for (i in 1:NDriver) {
tauOfDriver[i] <- pow(sdOfDriver[i],-2)
sdOfDriver[i] ~ dlnorm(mu.lsdD,tau.lsdD)
}
mu.lsdD ~ dnorm(0,.0001)
tau.lsdD <- pow(sigma.lsdD,-2)
sigma.lsdD ~ dunif(0,100)
for (i in 1:NDriver) {
meanDriver[i] <- grandmean + MDriver[uDriver[i]] + mean(MMonth[1:NMonth]) + mean(MBY[1:NBY]) + MEngine[uEngine[i]]
}
for (i in 1:NMonth) {
meanMonth[i] <- grandmean + mean(MDriver[1:NDriver]) + MMonth[i] + mean(MBY[1:NBY]) + mean(MEngine[1:NEngine])
}
for (i in 1:NBY) {
meanBY[i] <- grandmean + mean(MDriver[1:NDriver]) + mean(MMonth[1:NMonth]) +MBY[i]+ mean(MEngine[1:NEngine])
}
for ( i in 1:NDriver1) {
MDriver1[i] <- MDriver[Driver1[i]]
}
for ( i in 1:NDriver2) {
MDriver2[i] <- MDriver[Driver2[i]]
}
for ( i in 1:NDriver3) {
MDriver3[i] <- MDriver[Driver3[i]]
}

meanEngine[1] <- grandmean + mean(MDriver1) + mean(MMonth[1:NMonth])+ mean(MBY[1:NBY]) +MEngine[1]
meanEngine[2] <- grandmean + mean(MDriver2) + mean(MMonth[1:NMonth])+ mean(MBY[1:NBY]) +MEngine[2]
meanEngine[3] <- grandmean + mean(MDriver3) + mean(MMonth[1:NMonth])+ mean(MBY[1:NBY]) +MEngine[3]

}
model.file <- file.path(tempdir(),'mijnmodel.txt')
write.model(mymod,con=model.file)
inits <- function() list(
grandmean = rnorm(1,5,1),
MDriver = c(NA,rnorm(data_list$NDriver-1)) ,
MBY = c(NA,rnorm(data_list$NBY-1)) ,
MEngine = c(NA,rnorm(data_list$NEngine-1)) ,
MMonth = c(NA,rnorm(data_list$NMonth-1)) ,
tauDriver = runif(1,1,3),
tauBY = runif(1,1,3),
tauEngine=runif(1,1,3),
tauMonth=runif(1,1,3),
sdOfDriver = runif(data_list$NDriver,1,3)
)

parameters <- c('sdOfDriver','meanMonth','meanDriver','meanBY',
'meanEngine','sdDriver','sdMonth','sdEngine')
jagsfit <- jags(data=data_list,inits=inits,model.file=model.file,
parameters.to.save=parameters,n.chains=4,DIC=FALSE,n.iter=20000,
progress.bar='none')
png('jagsfit.png')
plot(jagsfit)
dev.off()

jagsfit.mc <- as.mcmc(jagsfit)
fitsummary <- summary(jagsfit.mc)
stats <- fitsummary$statistics
quants <- fitsummary$quantiles

library(ggplot2)

Engines <- as.data.frame(cbind(stats[grepl('meanEngine',rownames(stats)),],
quants[grepl('meanEngine',rownames(quants)),]))
Engines$type <- levels(usage$engine)
Engines
limits <- aes(ymax = `25%`, ymin=`75%`)
p <- ggplot(Engines, aes(y=Mean, x=type))
dodge <- position_dodge(width=0.9)
png('engines.png')
p + geom_point(position=dodge) +
geom_errorbar(limits, position=dodge, width=0.25) +
coord_flip()
dev.off()

Months <- as.data.frame(cbind(stats[grepl('meanMonth',rownames(stats)),],
quants[grepl('meanMonth',rownames(quants)),]))
Months$Month <- levels(usage$Month)
limits <- aes(ymax = `25%`, ymin=`75%`)
p <- ggplot(Months, aes(y=Mean, x=Month))
dodge <- position_dodge(width=0.9)
png('Months.png')
p + geom_point(position=dodge) +
geom_errorbar(limits, position=dodge, width=0.25) +
coord_flip()
dev.off()

Driver <- as.data.frame(cbind(stats[grepl('meanDriver',rownames(stats)),],
quants[grepl('meanDriver',rownames(quants)),]))
Driver <- cbind(uu,Driver)
limits <- aes(ymax = `25%`, ymin=`75%`)
p <- ggplot(Driver, aes(y=Mean, x=Driver,colour=engine))
dodge <- position_dodge(width=0.9)
png('driver.png')
p + geom_point(position=dodge) +
geom_errorbar(limits, position=dodge, width=0.25) +
coord_flip()
dev.off()

ag <- aggregate(usage$lp100,by=list(Driver=usage$Driver),mean)
min(ag$x)

Sunday, June 16, 2013

Distribution of car weights

Two weeks ago I described car data, among which weight distribution of cars in Netherlands. At that time it was purely plots. In the mean time I decided I wanted to model trends. As a first step of that, I decided to fit distributions for these data. I soon found out that most of the tools seem to be organized around individual objects as units. So, the 22 categories of weight might be extended to 100 000 records or so, one for each car. That is clearly not convenient. So, I brewed a MCMC algorithm for the Bayesian estimation of the distribution without all that. As I did most of this while on vacation, I had minimum access to books/internet, so it may be a bit clunky.

Data reading

Nothing new here, except I added manually a formal endpoint for the final weight category; 3500 kg. This number chosen because this is the maximum weight you may drive with a normal drivers license here.
weight1 <- read.csv2('Motorvoertuigen__per_010613140907.csv',na.strings='-')
weight2 <- weight1[!is.na(weight1$Waarde),]
weight2$BuildYear <- as.numeric(sub('Bouwjaar ','',as.character(weight2$Bouwjaren)))
weight2$RefYear <- as.numeric(sub(', 1 januari','',as.character(weight2$Peildatum)))
weightcats <- levels(weight2$Onderwerpen_2)
weightcats <- gsub('en meer','and more',weightcats)
levels(weight2$Onderwerpen_2) <- weightcats
lweightcats <- as.numeric(gsub('( |-).*$','',weightcats))
uweightcats <- as.numeric(gsub('(^[0-9]+-)|([ [:alpha:]]+$)','',weightcats))
uweightcats[uweightcats==2451] <- 3500
weight3 <- weight2[weight2$RefYear==2001,c(2,6)]
weight3$lower <- lweightcats[weight3$Onderwerpen_2]
weight3$upper <- uweightcats[weight3$Onderwerpen_2]
weight3
Onderwerpen_2 Waarde lower upper
14 0-450 kg 95 0 450
196 451-550 kg 33 451 550
378 551-650 kg 262 551 650
560 651-750 kg 19950 651 750
742 751-850 kg 58752 751 850
924 851-950 kg 81650 851 950
1106 951-1050 kg 57738 951 1050
1288 1051-1150 kg 91793 1051 1150
1470 1151-1250 kg 104933 1151 1250
1652 1251-1350 kg 89506 1251 1350
1834 1351-1450 kg 30814 1351 1450
2016 1451-1550 kg 19517 1451 1550
2198 1551-1650 kg 13828 1551 1650
2380 1651-1750 kg 5512 1651 1750
2562 1751-1850 kg 4131 1751 1850
2744 1851-1950 kg 1558 1851 1950
2926 1951-2050 kg 1769 1951 2050
3108 2051-2150 kg 430 2051 2150
3290 2151-2250 kg 205 2151 2250
3472 2251-2350 kg 89 2251 2350
3654 2351-2450 kg 137 2351 2450
3836 2451 kg and more 631 2451 3500

Jumping through mixture space

I had decided the model will consist of three or four different normal distributions with different weights. These weights obviously sum to 1, so it is not obvious how to jump in this space while keeping to the rule that a jump from A to B, must be just as probable as a jump from B to A. It would also be nice to have weights between 0 and 1. In the end I decided to do this per mixture component. Take one component, move it to logit space, take the jump, move it back. Shrink/inflate all others proportionally to achieve sum of 1. The plot below shows a trace through mixture space.

nspacejump <- function(x,i,sd) {

xold <- arm::logit(x[i])

xnew <- arm::invlogit(xold + rnorm(1,0,sd))

x <- x/sum(x[-i])*(1-xnew)

x[i] <- xnew

}

x <- rep(1/3,3)

xmat <- matrix(0,nrow=30000,ncol=3)

index <- 1

for (irow in 1:10000) {

for (icol in 1:3) {

xmat[index,] <- x

x <- nspacejump(x,icol,.1)

index <- index + 1

}

plot(x=c(1,-1,0,1),y=c(0,0,sqrt(3),0),type='l')

lines(x=xmat[,1]-xmat[,2],y=xmat[,3]*sqrt(3),col='blue')

The model

It appeared as final solution to be most convenient to estimate expected proportions for all weight categories and examine how well proportions fit. Priors for mean weight of cars, 0 to 3500, standard deviation of weight of cars normal(300,300), for weight of normal distributions uniform over mixture space. Obviously only the standard deviation is implemented, it functions to avoid a solution with one normal distribution having enormous standard deviation, which seems not realistic.

curldev <- function(counts,curmodel) {

dsd <- sum(dnorm(curmodel$sd,300,300,log=TRUE))

prop.exp <- matrix(0,nrow=nrow(counts),ncol=nrow(curmodel))

for (i in 1:nrow(curmodel)) {

prop.exp[,i] <- curmodel$w[i]*(pnorm(counts$upper,
curmodel$mean[i],curmodel$sd[i])
-pnorm(counts$lower,

curmodel$mean[i],curmodel$sd[i]))

}

finprop <- rowSums(prop.exp)

dsd + dmultinom(counts$Waarde,prob=finprop,log=TRUE)

}

Fitting

It is not a very smart algorithm. It took maybe 60000 iterations or more before I had the feeling it had converged and was sampling from the final distribution. Luckily it is just small data in this setup. For examining convergence I made plots of deviation, means and standard deviations over iterations. The third time I did the calculation I got lazy, a 100 000 run as burn in subsequently 10 000 iterations for results..

curmodel <- data.frame(

mean=c(1000,1100,1200,1500),

sd=c(200,200,200,200),

w=c(.25,.25,.25,.25)

)

cnow <- curldev(weight3,curmodel)

ndist <- nrow(curmodel)

niter <- 10000 # you need more, say 100 000 for convergence

result <- matrix(nrow=ndist*niter,ncol=3*ndist+1)

index <- 1

for (iter in 1:niter) {

for (dist in 1:ndist) {

nowmodel <- curmodel

result[index,] <- c(unlist(curmodel),cnow)

curmodel$mean[dist] <- curmodel$mean[dist] + rnorm(1,0,1)

curmodel$sd[dist] <- curmodel$sd[dist] * runif(1,1/1.01,1.01)

curmodel$w <- nspacejump(curmodel$w,dist,.1)

cnew <- curldev(weight3,curmodel)

# cat(c(cnow,cnew,'\n'))

if (exp(cnew-cnow) > runif(1) ) cnow <- cnew

else curmodel <- nowmodel

index <- index + 1

}

curmodel

mean sd w

1 864.3793 87.4235 0.31131942

2 1792.4924 650.7524 0.00655972

3 1188.4141 113.8444 0.51915833

4 1407.4449 234.6751 0.16296252

Result

First a table of fit of the final MCMC sample. It looks like the result is close to observed value. I also added a plot of the distributions where I thinned to 1 in 100 MCMC samples. Hence the red line looks a bit bold. Integral is a constant used to ensure the surface under the sample lines stays constant and is in line to the observed values.

rescounts <- function(counts,curmodel) {

prop.exp <- matrix(0,nrow=nrow(counts),ncol=nrow(curmodel))

for (i in 1:nrow(curmodel)) {

prop.exp[,i] <- curmodel$w[i]*
(pnorm(counts$upper,curmodel$mean[i],curmodel$sd[i])

-pnorm(counts$lower,curmodel$mean[i],curmodel$sd[i]))

}

finprop <- rowSums(prop.exp)

cbind(finprop*sum(counts$Waarde),counts$Waarde)

}

rescounts(weight3,curmodel)

[,1] [,2]

[1,] 65.92410 95

[2,] 71.68709 33

[3,] 1349.28494 262

[4,] 16246.30388 19950

[5,] 62352.31007 58752

[6,] 78830.29975 81650

[7,] 58194.45510 57738

[8,] 86826.12802 91793

[9,] 112381.22602 104933

[10,] 79122.57374 89506

[11,] 36005.92395 30814

[12,] 17873.04781 19517

[13,] 11812.20054 13828

[14,] 7602.84587 5512

[15,] 4219.25155 4131

[16,] 2033.40298 1558

[17,] 903.35197 1769

[18,] 423.51988 430

[19,] 247.82985 205

[20,] 183.80917 89

[21,] 152.41480 137

[22,] 579.90848 631

integral <- sum((weight3$upper-weight3$lower)*weight3$Waarde)

plot(c(Waarde,Waarde[nrow(weight3)])*100/c(upper-lower,3500-2451) ~
c(lower,3500),data=weight3,
type='s',xlim=c(0,3500),ylim=c(0,1.2e5),
xlab='Weight',ylab='Frequency')
for(i in seq(1,10000,100)) {

x <- seq(0,3500,10)

y <- rep(0,length(x))

newmodel <- data.frame(mean = result[i,1:4],

sd=result[i,5:8],

w=result[i,9:12])

for (ir in 1:nrow(newmodel))

y <- y + newmodel$w[ir]*dnorm(x,newmodel$mean[ir],newmodel$sd[ir])

lines(x=x,y=y/sum(y)/10*integral,col='red')

}

Sunday, June 2, 2013

Cars in Netherlands

I am looking for a new car. So when I saw there was an update on vehicles in Statistics Netherlands I just had to go and look at the data. So, I learned the brown is getting more popular, often the number of cars from a certain construction year is larger at six years of age than five years of age and lighter cars get more popular, especially in these later years in the crisis.

Color of Cars

The data I downloaded contains number of cars by color, build year and reference date. Unfortunately it is all in Dutch, but I did translate all relevant parts. The data actually contains 14 colors, including 'other' but some of the colors were so infrequent, it made all confusing. So I added some colors to other. To plot what is sold in a certain year, I took reference date 1 January, the year after building.

The data shown grey and black as most sold, white and brown do get more popular, blue and green get less popular. There is also a little dent in 2009 (crisis) and a new decrease in 2012. New tells us 2013 is worse.

% change in number of cars is interesting. I was expecting black cars maybe get more in accidents or something, because they are less visible. What I did see is a marked decrease in white cars, especially in the beginning of the previous decade. And there is a marked increase in cars of say 5 or 6 years old. Finally, at say ten years of age the cars start disappearing. The marked increase at 5 or 6 years may be explained by importing older cars. I sometimes read or hear they are bought in Germany. The Dutch tax on imported cars is fairly high, so it is interesting to import a cheaper second hand, which has significantly less tax. For example; Athlon car lease Germany has a Dutch language site especially for this purpose.

Fuel

There are three kinds of frequently used fuels; gasoline (Benzine), diesel and LPG (liquid propane gas). Gasoline is the standard, most expensive fuel, lowest in road tax, diesel is cheaper to run, but more expensive in tax, LPG even more so. It boils down to this, Diesel and LPG if you drive a lot, Gasoline if you drive a bit. It is possible to convert from Gasoline to LPG, which may explain the big increases in number of LPG for older cars.

Weight

Weight is also interesting. It is tax categories, these I translated into continuous values by taking the lower end of the brackets. The first bracket is 0 to 450 kg, then 100 kg increases. Beats the hell out of me what cars are under 650 kg, but there is something. Obviously most cars are between about 700 and 1800 kg. There is also some shift at 750 kg, I you have a good color discrimination you might see what exactly. My eyes won't so the next plot shows that cars 651 to 750 kg are getting less frequent, 751 to 850 much more frequent. There are tax advantages to less fuel consumption, and the economy can only strengthen that. Somewhat heavier cars suffer from a decrease. Above 1500 kg it seems the market works differently.

The final plots show number of heaver cars are most decreasing when older. Which makes sense, they are still pretty expensive to run in terms of fuel and tax. The final plot I chose because you can see the big increase in love for small cars again, it would seem old small cars don't disappear or their number even increases, while the bigger cars, the 1151 to 1250 bracket has prominent decreases from 7 years old.

R code.

library(ggplot2)
col1 <- read.csv2('Motorvoertuigen__per_260513101348.csv',na.strings='-')
col2 <- col1[!is.na(col1$Waarde),]
col2$BuildYear <- as.numeric(sub('Bouwjaar ','',as.character(col2$Bouwjaren)))
col2$RefYear <- as.numeric(sub(', 1 januari','',as.character(col2$Peildatum)))
col2$Colour <- factor( c("Beige", "Blue", "Brown", "Other", "Yellow",
"Grey", "Green", "Other", "Other",
"Other", "Red", "Other", "White" ,"Black")[col2$Onderwerpen_2])
col3 <- aggregate(col2$Waarde,list(Colour=col2$Colour,BuildYear=col2$BuildYear,
RefYear=col2$RefYear),sum)
col4 <- col3[col3$RefYear==col3$BuildYear+1,]
colourcode <- c("#C8AD7F", "Black","Blue", "Brown", "Green" ,
"Grey" , "Purple" , "Red","White" , "Yellow")
png('col1.png')
p <- ggplot(col4, aes(x=BuildYear, y=x, colour=Colour))
p + geom_line() +
scale_colour_manual(values=colourcode) +
scale_y_log10("Numer of vehicles") +
scale_x_continuous(breaks=seq(2000,2012,2))
dev.off()

##############

lastyeardata <- col3[,c('x','BuildYear','Colour','RefYear')]
lastyeardata$RefYear <- lastyeardata$RefYear+1
colnames(lastyeardata)[colnames(lastyeardata)=='x'] <- 'LastYearAmount'
change <- merge(x=col3,y=lastyeardata)
change$Pchange <- with(change,100*(x-LastYearAmount)/LastYearAmount)
change$Age <- change$RefYear-change$BuildYear
png('col2.png')
p <- ggplot(change[change$BuildYear<2010,], aes(x=Age, y=Pchange, colour=Colour))
p + geom_line() +
scale_colour_manual(values=colourcode) +
scale_y_continuous("Numer of vehicles") +
facet_wrap(~BuildYear,nrow=2)
dev.off()

############
fuel1 <- read.csv2('Motorvoertuigen__per_010613135050.csv',na.strings='-')
fuel2 <- fuel1[!is.na(fuel1$Waarde),]
fuel2$BuildYear <- as.numeric(sub('Bouwjaar ','',as.character(fuel2$Bouwjaren)))
fuel2$RefYear <- as.numeric(sub(', 1 januari','',as.character(fuel2$Peildatum)))
fuel4 <- fuel2[fuel2$RefYear==fuel2$BuildYear+1,]
png('fuel1.png')
p <- ggplot(fuel4, aes(x=BuildYear, y=Waarde, colour=Onderwerpen_2))
p + geom_line() +
scale_y_continuous("Numer of vehicles") +
scale_x_continuous(breaks=seq(2000,2012,2)) +
labs(colour="Fuel")
dev.off()
##

lastyeardata <- fuel2[,c('Waarde','BuildYear','Onderwerpen_2','RefYear')]
lastyeardata$RefYear <- lastyeardata$RefYear+1
colnames(lastyeardata)[colnames(lastyeardata)=='Waarde'] <- 'LastYearAmount'
change <- merge(x=fuel2,y=lastyeardata)
change$Pchange <- with(change,100*(Waarde-LastYearAmount)/LastYearAmount)
change$Age <- change$RefYear-change$BuildYear
png('fuel2.png')
p <- ggplot(change[change$BuildYear<2010,],
aes(x=Age, y=Pchange, colour=Onderwerpen_2))
p + geom_line() +
scale_y_continuous("Numer of vehicles") +
facet_wrap(~BuildYear,nrow=2) +
labs(colour="Fuel")
dev.off()

##############

weight1 <- read.csv2('Motorvoertuigen__per_010613140907.csv',na.strings='-')
weight2 <- weight1[!is.na(weight1$Waarde),]
weight2$BuildYear <- as.numeric(sub('Bouwjaar ','',as.character(weight2$Bouwjaren)))
weight2$RefYear <- as.numeric(sub(', 1 januari','',as.character(weight2$Peildatum)))
weightcats <- levels(weight2$Onderwerpen_2)
weightcats <- gsub('en meer','and more',weightcats)
levels(weight2$Onderwerpen_2) <- weightcats
lweightcats <- as.numeric(gsub('( |-).*$','',weightcats))
weight2$lweight <- lweightcats[weight2$Onderwerpen_2]
weightcats <- weightcats[order(lweightcats)]
weight2$WeightCat <- factor(weight2$Onderwerpen_2,levels=weightcats)
weight4 <- weight2[weight2$RefYear==weight2$BuildYear+1,]
png('weight1.png')
p <- ggplot(weight4, aes(x=lweight, y=Waarde, colour=factor(BuildYear)))
p + geom_line() +
scale_y_continuous("Numer of vehicles") +
labs(colour='Build Year')
dev.off()

png('weight2.png')
p <- ggplot(weight4[weight4$lweight>600& weight4$lweight<1800,], aes(x=BuildYear,y=Waarde))
p + geom_line() +
scale_y_continuous("Numer of vehicles") +
facet_wrap(~WeightCat)
dev.off()
##

lastyeardata <- weight2[,c('Waarde','BuildYear','Onderwerpen_2','RefYear')]
lastyeardata$RefYear <- lastyeardata$RefYear+1
colnames(lastyeardata)[colnames(lastyeardata)=='Waarde'] <- 'LastYearAmount'
change <- merge(x=weight2,y=lastyeardata)
change$Pchange <- with(change,100*(Waarde-LastYearAmount)/LastYearAmount)
change$Age <- change$RefYear-change$BuildYear
png('weight3.png')
p <- ggplot(change[change$lweight>600& change$lweight<2200 & change$BuildYear<2010,]
, aes(x=Age, y=Pchange, colour=WeightCat))
p + geom_line() +
scale_y_continuous("% Chane in Numer of vehicles") +
facet_wrap(~BuildYear,nrow=2)
dev.off()
#
png('weight4.png')
p <- ggplot(change[change$lweight>600 & change$lweight<1200,],
aes(x=RefYear, y=Pchange, colour=factor(Age)))
p + geom_line() +
scale_y_continuous("% Change in Number of vehicles") +
facet_wrap(~WeightCat)
dev.off()