Sunday, June 30, 2013

Faster calculation

Last week I decided to speed up my distribution fitting functions of two weeks ago. These were bold words. My try of Rcpp was a failure. Just plain optimization helped a bit better. Using the compiler package added a bit. (the compiler package does not compile, but a 10% speed increase at minimum costs is never bad).

Rcpp (with inline)

Rcpp had a number of problems. First of all, it did not work on my Windows box. It ended with an error message:
Error in compileCode(f, code, language = language, verbose = verbose) : 
  Compilation ERROR, function(s)/method(s) not created! 
In addition: Warning message:
running command 'C:/Users/Kees/Documents/R/R-3.0.1/bin/x64/R CMD SHLIB file12102de67044.cpp 2> file12102de67044.cpp.err.txt' had status 1 
As far as I understand the .cpp is not found. Which is not strange, it resides in a temp directory: 
C:\Users\Kees\AppData\Local\Temp\Rtmp25t6tT allocated for the current session. I gave up after playing around for half an evening and went to plan B, a Fedora box, which worked without a problem. I have no C++ experience, my C time is like 20 years ago. So using some examples I was happy when I converted a few lines of code. The part was were chosen for simplicity rather than speed expectations, still, I used 97% of the original time over the whole function. Subsequently I tried a larger part, where a data.frame was converted. Unhappily, the whole function used 200% of the original time. It seemed there suddenly appeared a try/tryCatch which consumed more time than the original code. At this point I started to abandon Rcpp and tried more ordinary approaches. Learning Rcpp is now a bit on the back burner.

Code optimization

This is actually fairly simple. Vectors beat for loops, especially the inner vectors. Remove stuff from inner loops to outer loops. Take functions called and extract parts needed. For example, these two functions do the same, yet the first is much slower. Adding the compiler gives a little extra
nspacejump0 <- function(x,i,sd) {
  xold <- arm::logit(x[i])
  xnew <- arm::invlogit(xold + rnorm(1,0,sd))
  x <- x/sum(x[-i])*(1-xnew)
  x[i] <- xnew
  x
}

nspacejump1 <- function(x,i,sd) {
  xold <- log(x[i]/(1-x[i])) + rnorm(1,0,sd)
  xnew <- 1/(1+exp(-xold))
  x <- x/sum(x[-i])*(1-xnew)
  x[i] <- xnew
  x
}
nspacejump0c <- cmpfun(nspacejump0)
nspacejump1c <- cmpfun(nspacejump1)
############### speed run ####################
microbenchmark(
    nspacejump0(c(.2,.4,.1,.3),1,.1),
    nspacejump1(c(.2,.4,.1,.3),1,.1),
    nspacejump0c(c(.2,.4,.1,.3),1,.1),
    nspacejump1c(c(.2,.4,.1,.3),1,.1))
 Unit: microseconds
                                        expr     min      lq   median      uq
  nspacejump0(c(0.2, 0.4, 0.1, 0.3), 1, 0.1) 111.428 114.360 115.8265 117.659
  nspacejump1(c(0.2, 0.4, 0.1, 0.3), 1, 0.1)  29.323  30.789  31.5230  32.988
 nspacejump0c(c(0.2, 0.4, 0.1, 0.3), 1, 0.1) 101.165 104.097 106.2960 107.763
 nspacejump1c(c(0.2, 0.4, 0.1, 0.3), 1, 0.1)  19.793  21.259  21.9930  22.726
     max neval
 287.367   100
  79.173   100
 178.138   100
 108.495   100

Second part code optimization

In the deviance calculation there are more possibilities for short cuts. The original calculates expected proportions from limits. In the end, this can be moved to a more outer loop, since only one distribution is modified at a time. On top of that (not shown because it was moved out) pnorm(upper,mean,sd)-pnorm(lower,mean,sd) has almost the same  limits and became something like: pn<-pnorm(limits,mean,sd);pn[-1]-pn[-23]. Dmultinom has been eliminated, and since lgamma(x + 1) is only a function of x, the counts, this is constant for the data used, so can be moved completely outside. Obviously it again helps to use the compiler.
curldev0 <- function(counts,curmodel) {
  dsd <- sum(dnorm(curmodel$sd,300,300,log=TRUE))
  prop.exp <- matrix(0,nrow=nrow(counts),ncol=nrow(curmodel))
  for (i in 1:nrow(curmodel)) {
    prop.exp[,i] <- curmodel$w[i]*(pnorm(counts$upper,curmodel$mean[i],curmodel$sd[i])-
          pnorm(counts$lower,curmodel$mean[i],curmodel$sd[i]))
  }
  finprop <- rowSums(prop.exp)
  dsd + dmultinom(counts$Waarde,prob=finprop,log=TRUE)
}
curldev5 <- function(counts,curmodel,pemodel) {
  dsd <- sum(dnorm(curmodel$sd,300,300,log=TRUE))
  prop.exp <- pemodel  %*% curmodel$w
  prop.exp <-  prop.exp/sum(prop.exp)
  dsd + lgamma(sum(counts$Waarde) + 1) + sum(counts$Waarde * log(prop.exp) - counts$lgam)
  #dmultinom(counts$Waarde,prob=prop.exp,log=TRUE)
}
##########speed run############
curldev0c <- cmpfun(curldev0)
curldev5c <- cmpfun(curldev5)
curmodel <- data.frame(
    mean=c( 892.0141263, 1.417520e+03, 1306.8248062, 1.939645e+03),
    sd  =c(  99.5660288, 2.129247e+02,  194.2452168, 1.684932e+02),
    w   =c(   0.2252041, 4.384217e-02,    0.7125014, 1.845233e-02))
pemodel <- matrix(0,nrow=22,ncol=nrow(curmodel))
for (i in 1:nrow(curmodel)) {
  pn <- pnorm(weightlims,curmodel$mean[i],curmodel$sd[i])
  pemodel[,i] <- (pn[-1]-pn[-23])
}

microbenchmark(
#    c(
curldev0(weight3,curmodel),
curldev5(weight3,curmodel,pemodel),
curldev0c(weight3,curmodel),
curldev5c(weight3,curmodel,pemodel)
)
Unit: microseconds
                                  expr     min       lq   median       uq
           curldev0(weight3, curmodel) 435.448 442.4130 447.1780 465.5050
  curldev5(weight3, curmodel, pemodel)  58.646  61.5785  68.1760  71.8420
          curldev0c(weight3, curmodel) 394.396 402.4600 407.9585 426.6515
 curldev5c(weight3, curmodel, pemodel)  54.248  57.1800  63.0450  67.4430
      max neval
 1625.234   100
  361.407   100
  823.980   100
  164.943   100

third part code optimization

This is the calling environment, so downstream adaptations have adaptations here. In the end the time used is about 30% of the original. 
version0 <- function(weight3,curmodel,niter) {
  cnow <- curldev0(weight3,curmodel)
  ndist <- nrow(curmodel)
  result <- matrix(nrow=ndist*niter,ncol=3*ndist+1)
  index <- 1
  
  for (iter in 1:niter) {
    for (dist in 1:ndist) {
      nowmodel <- curmodel
      result[index,] <- c(unlist(curmodel),cnow)
      
      curmodel$mean[dist] <- curmodel$mean[dist] + rnorm(1,0,1)
      curmodel$sd[dist] <- curmodel$sd[dist] * runif(1,1/1.01,1.01)
      curmodel$w <- nspacejump0(curmodel$w,dist,.1)
      cnew <- curldev0(weight3,curmodel)
# cat(c(cnow,cnew,'\n'))  
      if (exp(cnew-cnow) > runif(1) ) cnow <- cnew
      else curmodel <- nowmodel
      index <- index + 1
    } 
  }
  return(list(result,curmodel))
}
version6 <- function(weight3,curmodel,niter) {
  pemodel <- matrix(0,nrow=22,ncol=nrow(curmodel))
  pemodel2 <- pemodel
  for (i in 1:nrow(curmodel)) {
    pn <- pnorm(weightlims,curmodel$mean[i],curmodel$sd[i])
    pemodel[,i] <- (pn[-1]-pn[-23])
  }
  cnow <- curldev5c(weight3,curmodel,pemodel)
  ndist <- nrow(curmodel)
  result <- matrix(nrow=ndist*niter,ncol=3*ndist+1)
  index <- 1
  for (iter in 1:niter) {
    for (dist in 1:ndist) {
      nowmodel <- curmodel
      nowpe <- pemodel
      result[index,] <- c(curmodel$mean,curmodel$sd,curmodel$w,cnow)
      curmodel$mean[dist] <- curmodel$mean[dist] + rnorm(1,0,1)
      curmodel$sd[dist] <- curmodel$sd[dist] * runif(1,1/1.01,1.01)
      curmodel$w <- nspacejump1c(curmodel$w,dist,.1)
      pn <- pnorm(weightlims,curmodel$mean[dist],curmodel$sd[dist])
      pemodel[,dist] <- (pn[-1]-pn[-23])
      cnew <- curldev5c(weight3,curmodel,pemodel)
# cat(c(cnow,cnew,'\n'))  
      if (exp(cnew-cnow) > runif(1) ) cnow <- cnew
      else {curmodel <- nowmodel
        pemodel <- nowpe }
      index <- index + 1
    } 
  }
  return(list(result=result,curmodel=curmodel))
}
############### speed run ###############
version0c <- cmpfun(version0)
version6c <- cmpfun(version6)
microbenchmark(
    version0(weight3,curmodel,200),
    version6(weight3,curmodel,200),
    version0c(weight3,curmodel,200),
    version6c(weight3,curmodel,200),
    times=5)
Unit: milliseconds
                              expr      min       lq   median       uq      max
  version0(weight3, curmodel, 200) 648.9325 655.1388 660.7138 665.8644 669.2029
  version6(weight3, curmodel, 200) 264.1112 266.0451 267.8924 270.6649 274.3435
 version0c(weight3, curmodel, 200) 617.1030 627.7700 630.9208 638.2391 645.8507
 version6c(weight3, curmodel, 200) 230.3676 230.6623 231.3389 236.2792 237.2043
 neval
     5
     5
     5
     5

Monday, June 24, 2013

Opel Corsa Diesel Usage

I wanted to extend my car weight distribution calculation of June 16 from only 2000 to years 2000 to 2013. Unfortunately, come Sunday afternoon the code seemed too slow and not even the beginning of a post. So, I went on to another calculation I was working on; scraping Diesel usage data from the web and examining how it varied. The background of this is the repeated news items one reads, which state that actual fuel usage is much higher than reported. Since I may be in the market for an Opel Corsa Diesel, this was the car of choice.

Sourcing the data  

In contrast to my normal posts I won't be putting code for data scraping here. The reason is purely ethical; should I make it all too easy to get these data? I am not so sure webmasters like that idea. I took the data from http://www.spritmonitor.de/. The scraping techniques I took from r-bloggers, original was on statistical-research.com. At the end I had data where first records and relevant variables looked like shown below. The data was selected from 2010 to now with engine sizes 55, 70 and 96 kW. Engine sizes which were one kW off, were merged to the stated levels, because I had the feeling these were errors. In the end this resulted in 3713 records and 106 drivers.
usage
  Driver Month lp100 buildyear         engine
1     63    06  4.38   BJ 2011  70 kW (95 PS)
2     63    06  3.55   BJ 2011  70 kW (95 PS)
3     63    05  4.16   BJ 2011  70 kW (95 PS)
4     63    05  3.50   BJ 2011  70 kW (95 PS)
5     63    05  3.90   BJ 2011  70 kW (95 PS)
6     63    04  3.54   BJ 2011  70 kW (95 PS)

Model

The model used makes usage, liters per 100 km (lp100), a function of engine type, Month, driver and buildyear. It was also chosen to make the driver's standard deviation part of the model. The model is estimated in JAGS. The code is at the bottom of the post. Below the plot from jagsfit. From my point of view this looks reasonable. The trend at drivers is because the data was sorted when scraped.

Engines

The data shown are conditional on their own drivers. It seems the strongest engine has the highest usage, while the weakest engine has second worst performance. This almost suggests that the 55 kW engine is just too small.

Months

The data shown is the average usage over all drivers and cars. Summer is cheaper to drive than winter. However, the difference is smaller than between engines.

Drivers

Drivers' means are conditional on their own engines, averaged over months and build years. In a sense, these are the expected long term usages, with shrinkage to the population means for drivers with limited data. It is easy possible to have a style for more than 6 liter per 100 km, one extreme driver expects more than 7 liter per 100 km, with the most economic car (upon examining the data notes, it may be this driver has loads of within city driving, average low speed, maybe loads of start/stop with air conditioning on). On the other hand, some drivers manage clearly less than 4.5 liter per 100 km.  3.8 l/100 km is what the best driver observes as his/her personal average, in the plot it is slightly shrunk upwards.

Factory results and driver results

Based on these data one can certainly expect differences between what a factory presents as data and what is observed. If September weather gives 0.2 l/100 km less usage than December, the lower value will be given. The driving cycle for fuel usage has, as I read it, a fairly sedate speed. So, close or better than the best driver. The average driver by definition has a much higher usage than the best driver. A lease driver might be among the worst performers. After all, not paying your own fuel or repairs does not make for austere driving.  

R code

library(R2jags)
data_list <- list(NDriver=nlevels(usage$Driver),
    NBY=nlevels(usage$buildyear),
    NEngine=nlevels(usage$engine),
    NMonth=nlevels(usage$Month),
    N=nrow(usage))

data_list$Driver <- (1:data_list$NDriver)[usage$Driver]
data_list$BY    <- (1:data_list$NBY)[usage$buildyear]
data_list$Engine<- (1:data_list$NEngine)[usage$engine]
data_list$Month  <- (1:data_list$NMonth)[usage$Month]
data_list$lp100  <- usage$lp100
uu <- unique(data.frame(engine=usage$engine,cDriver=usage$Driver,Driver=data_list$Driver,Engine=data_list$Engine))
data_list$uDriver <- uu$Driver
data_list$uEngine <- uu$Engine
data_list$Driver1 <- uu$Driver[uu$Engine==1]
data_list$Driver2 <- uu$Driver[uu$Engine==2]
data_list$Driver3 <- uu$Driver[uu$Engine==3]
data_list$NDriver1 <- length(data_list$Driver1)
data_list$NDriver2 <- length(data_list$Driver2)
data_list$NDriver3 <- length(data_list$Driver3)
mymod <- function() {
  for (i in 1:N) {
    fit[i] <- grandmean +
        MMonth[Month[i]] + 
        MEngine[Engine[i]] + 
        MBY[BY[i]] + 
        MDriver[Driver[i]]
    lp100[i] ~dnorm(fit[i],tauOfDriver[Driver[i]])
  }
  grandmean ~ dnorm(0,.001)
  MMonth[1] <- 0
  for (i in 2:NMonth) {
    MMonth[i] ~ dnorm(0,tauMonth) 
  }
  tauMonth ~ dgamma(0.001,0.001)
  sdMonth <- sqrt(1/tauMonth)
  MBY[1] <- 0
  for (i in 2:NBY) {
    MBY[i] ~ dnorm(0,tauBY) 
  }
  tauBY ~ dgamma(0.001,0.001)
  sdBY <- sqrt(1/tauBY)
  MEngine[1] <- 0
  for (i in 2:NEngine) {
    MEngine[i] ~ dnorm(0,tauEngine) 
  }
  tauEngine ~ dgamma(0.001,0.001)
  sdEngine <- sqrt(1/tauEngine)
  
  MDriver[1] <- 0
  for (i in 2:NDriver) {
    MDriver[i] ~ dnorm(0,tauDriver) 
  }
  tauDriver ~ dgamma(0.001,0.001)
  sdDriver <- sqrt(1/tauDriver)
  for (i in 1:NDriver) {
    tauOfDriver[i] <-  pow(sdOfDriver[i],-2)
    sdOfDriver[i] ~ dlnorm(mu.lsdD,tau.lsdD)
  }
  mu.lsdD ~ dnorm(0,.0001)
  tau.lsdD <- pow(sigma.lsdD,-2)
  sigma.lsdD ~ dunif(0,100) 
  for (i in 1:NDriver) {
    meanDriver[i] <- grandmean + MDriver[uDriver[i]] + mean(MMonth[1:NMonth]) + mean(MBY[1:NBY]) + MEngine[uEngine[i]]
  }
  for (i in 1:NMonth) {
    meanMonth[i] <- grandmean + mean(MDriver[1:NDriver]) + MMonth[i] + mean(MBY[1:NBY]) + mean(MEngine[1:NEngine])
  }
  for (i in 1:NBY) {
    meanBY[i] <- grandmean + mean(MDriver[1:NDriver]) + mean(MMonth[1:NMonth]) +MBY[i]+ mean(MEngine[1:NEngine])
  }
  for ( i in 1:NDriver1) {
    MDriver1[i] <- MDriver[Driver1[i]]
  }
  for ( i in 1:NDriver2) {
    MDriver2[i] <- MDriver[Driver2[i]]
  }
  for ( i in 1:NDriver3) {
    MDriver3[i] <- MDriver[Driver3[i]]
  }
  
  meanEngine[1] <- grandmean + mean(MDriver1) + mean(MMonth[1:NMonth])+ mean(MBY[1:NBY]) +MEngine[1]
  meanEngine[2] <- grandmean + mean(MDriver2) + mean(MMonth[1:NMonth])+ mean(MBY[1:NBY]) +MEngine[2]
  meanEngine[3] <- grandmean + mean(MDriver3) + mean(MMonth[1:NMonth])+ mean(MBY[1:NBY]) +MEngine[3]
  
}
model.file <- file.path(tempdir(),'mijnmodel.txt')
write.model(mymod,con=model.file)
inits <- function() list(
      grandmean = rnorm(1,5,1),
      MDriver = c(NA,rnorm(data_list$NDriver-1)) ,
      MBY = c(NA,rnorm(data_list$NBY-1)) ,
      MEngine = c(NA,rnorm(data_list$NEngine-1)) ,
      MMonth = c(NA,rnorm(data_list$NMonth-1)) ,
      tauDriver = runif(1,1,3),
      tauBY = runif(1,1,3),
      tauEngine=runif(1,1,3),
      tauMonth=runif(1,1,3),
      sdOfDriver = runif(data_list$NDriver,1,3)
  )

parameters <- c('sdOfDriver','meanMonth','meanDriver','meanBY', 
    'meanEngine','sdDriver','sdMonth','sdEngine')
jagsfit <- jags(data=data_list,inits=inits,model.file=model.file,
    parameters.to.save=parameters,n.chains=4,DIC=FALSE,n.iter=20000,
    progress.bar='none')
png('jagsfit.png')
plot(jagsfit)
dev.off()

jagsfit.mc <- as.mcmc(jagsfit)
fitsummary <- summary(jagsfit.mc)
stats <- fitsummary$statistics
quants <- fitsummary$quantiles

library(ggplot2)

Engines <- as.data.frame(cbind(stats[grepl('meanEngine',rownames(stats)),],
        quants[grepl('meanEngine',rownames(quants)),]))
Engines$type <- levels(usage$engine)
Engines
limits <- aes(ymax = `25%`, ymin=`75%`)
p <- ggplot(Engines, aes(y=Mean, x=type))
dodge <- position_dodge(width=0.9)
png('engines.png')
p + geom_point(position=dodge) + 
    geom_errorbar(limits, position=dodge, width=0.25) +
    coord_flip()
dev.off()

Months <- as.data.frame(cbind(stats[grepl('meanMonth',rownames(stats)),],
        quants[grepl('meanMonth',rownames(quants)),]))
Months$Month <- levels(usage$Month)
limits <- aes(ymax = `25%`, ymin=`75%`)
p <- ggplot(Months, aes(y=Mean, x=Month))
dodge <- position_dodge(width=0.9)
png('Months.png')
p + geom_point(position=dodge) + 
    geom_errorbar(limits, position=dodge, width=0.25) +
    coord_flip()
dev.off()

Driver <- as.data.frame(cbind(stats[grepl('meanDriver',rownames(stats)),],
        quants[grepl('meanDriver',rownames(quants)),]))
Driver <- cbind(uu,Driver)
limits <- aes(ymax = `25%`, ymin=`75%`)
p <- ggplot(Driver, aes(y=Mean, x=Driver,colour=engine))
dodge <- position_dodge(width=0.9)
png('driver.png')
p + geom_point(position=dodge) + 
    geom_errorbar(limits, position=dodge, width=0.25) +
    coord_flip()
dev.off()

ag <- aggregate(usage$lp100,by=list(Driver=usage$Driver),mean)
min(ag$x)

Sunday, June 16, 2013

Distribution of car weights

Two weeks ago I described car data, among which weight distribution of cars in Netherlands. At that time it was purely plots. In the mean time I decided I wanted to model trends. As a first step of that, I decided to fit distributions for these data. I soon found out that most of the tools seem to be organized around individual objects as units. So, the 22 categories of weight might be extended to 100 000 records or so, one for each car. That is clearly not convenient. So, I brewed a MCMC algorithm for the Bayesian estimation of the distribution without all that. As I did most of this while on vacation, I had minimum access to books/internet, so it may be a bit clunky.

Data reading

Nothing new here, except I added manually a formal endpoint for the final weight category; 3500 kg. This number chosen because this is the maximum weight you may drive with a normal drivers license here.
weight1 <- read.csv2('Motorvoertuigen__per_010613140907.csv',na.strings='-')
weight2 <- weight1[!is.na(weight1$Waarde),]
weight2$BuildYear <- as.numeric(sub('Bouwjaar ','',as.character(weight2$Bouwjaren)))
weight2$RefYear <- as.numeric(sub(', 1 januari','',as.character(weight2$Peildatum)))
weightcats <- levels(weight2$Onderwerpen_2)
weightcats <- gsub('en meer','and more',weightcats)
levels(weight2$Onderwerpen_2) <- weightcats 
lweightcats <- as.numeric(gsub('( |-).*$','',weightcats))
uweightcats <- as.numeric(gsub('(^[0-9]+-)|([ [:alpha:]]+$)','',weightcats))
uweightcats[uweightcats==2451] <- 3500
weight3 <- weight2[weight2$RefYear==2001,c(2,6)]
weight3$lower <- lweightcats[weight3$Onderwerpen_2]
weight3$upper <- uweightcats[weight3$Onderwerpen_2]
weight3
        Onderwerpen_2 Waarde lower upper
14           0-450 kg     95     0   450
196        451-550 kg     33   451   550
378        551-650 kg    262   551   650
560        651-750 kg  19950   651   750
742        751-850 kg  58752   751   850
924        851-950 kg  81650   851   950
1106      951-1050 kg  57738   951  1050
1288     1051-1150 kg  91793  1051  1150
1470     1151-1250 kg 104933  1151  1250
1652     1251-1350 kg  89506  1251  1350
1834     1351-1450 kg  30814  1351  1450
2016     1451-1550 kg  19517  1451  1550
2198     1551-1650 kg  13828  1551  1650
2380     1651-1750 kg   5512  1651  1750
2562     1751-1850 kg   4131  1751  1850
2744     1851-1950 kg   1558  1851  1950
2926     1951-2050 kg   1769  1951  2050
3108     2051-2150 kg    430  2051  2150
3290     2151-2250 kg    205  2151  2250
3472     2251-2350 kg     89  2251  2350
3654     2351-2450 kg    137  2351  2450
3836 2451 kg and more    631  2451  3500

Jumping through mixture space

I had decided the model will consist of three or four different normal distributions with different weights. These weights obviously sum to 1, so it is not obvious how to jump in this space while keeping to the rule that a jump from A to B, must be just as probable as a jump from B to A. It would also be nice to have weights between 0 and 1. In the end I decided to do this per mixture component. Take one component, move it to logit space, take the jump, move it back. Shrink/inflate all others proportionally to achieve sum of 1.  The plot below shows a trace through mixture space.

nspacejump <- function(x,i,sd) {
xold <- arm::logit(x[i])
xnew <- arm::invlogit(xold + rnorm(1,0,sd))
x <- x/sum(x[-i])*(1-xnew)
x[i] <- xnew
x
}

x <- rep(1/3,3)
xmat <- matrix(0,nrow=30000,ncol=3)
index <- 1
for (irow in 1:10000) {
for (icol in 1:3) {
xmat[index,] <- x
x <- nspacejump(x,icol,.1)
   index <- index + 1
}
}

plot(x=c(1,-1,0,1),y=c(0,0,sqrt(3),0),type='l')
lines(x=xmat[,1]-xmat[,2],y=xmat[,3]*sqrt(3),col='blue')

The model

It appeared as final solution to be most convenient to estimate expected proportions for all weight categories and examine how well proportions fit. Priors for mean weight of cars, 0 to 3500, standard deviation of weight of cars normal(300,300), for weight of normal distributions uniform over mixture space. Obviously only the standard deviation is implemented, it functions to avoid a solution with one normal distribution having enormous standard deviation, which seems not realistic.

curldev <- function(counts,curmodel) {
  dsd <- sum(dnorm(curmodel$sd,300,300,log=TRUE))
prop.exp <- matrix(0,nrow=nrow(counts),ncol=nrow(curmodel))
for (i in 1:nrow(curmodel)) {
prop.exp[,i] <- curmodel$w[i]*(pnorm(counts$upper,
                                          curmodel$mean[i],curmodel$sd[i])
                                   -pnorm(counts$lower,
                                          curmodel$mean[i],curmodel$sd[i]))
}
finprop <- rowSums(prop.exp)
  dsd + dmultinom(counts$Waarde,prob=finprop,log=TRUE)
}

Fitting

It is not a very smart algorithm. It took maybe 60000 iterations or more before I had the feeling it had converged and was sampling from the final distribution. Luckily it is just small data in this setup. For examining convergence I made plots of deviation, means and standard deviations over iterations. The third time I did the calculation I got lazy, a 100 000 run as burn in subsequently 10 000 iterations for results..
curmodel <- data.frame(
mean=c(1000,1100,1200,1500),
sd=c(200,200,200,200),
w=c(.25,.25,.25,.25)
)

cnow <- curldev(weight3,curmodel)
ndist <- nrow(curmodel)
niter <- 10000 # you need more, say 100 000 for convergence
result <- matrix(nrow=ndist*niter,ncol=3*ndist+1)
index <- 1

for (iter in 1:niter) {
for (dist in 1:ndist) {
nowmodel <- curmodel
result[index,] <- c(unlist(curmodel),cnow)
curmodel$mean[dist] <- curmodel$mean[dist] + rnorm(1,0,1)
curmodel$sd[dist] <- curmodel$sd[dist] * runif(1,1/1.01,1.01)
curmodel$w <- nspacejump(curmodel$w,dist,.1)
cnew <- curldev(weight3,curmodel)
# cat(c(cnow,cnew,'\n'))  
if (exp(cnew-cnow) > runif(1) ) cnow <- cnew
else curmodel <- nowmodel
index <- index + 1
}
curmodel
       mean       sd          w
1  864.3793  87.4235 0.31131942
2 1792.4924 650.7524 0.00655972
3 1188.4141 113.8444 0.51915833
4 1407.4449 234.6751 0.16296252

Result

First a table of fit of the final MCMC sample. It looks like the result is close to observed value. I also added a plot of the distributions where I thinned to 1 in 100 MCMC samples. Hence the red line looks a bit bold. Integral is a constant used to ensure the surface under the sample lines stays constant and is in line to the observed values.
rescounts <- function(counts,curmodel) {
  prop.exp <- matrix(0,nrow=nrow(counts),ncol=nrow(curmodel))
  for (i in 1:nrow(curmodel)) {
    prop.exp[,i] <- curmodel$w[i]*
         (pnorm(counts$upper,curmodel$mean[i],curmodel$sd[i])
         -pnorm(counts$lower,curmodel$mean[i],curmodel$sd[i]))
  }
  finprop <- rowSums(prop.exp)
  cbind(finprop*sum(counts$Waarde),counts$Waarde)
}
rescounts(weight3,curmodel)

              [,1]   [,2]
 [1,]     65.92410     95
 [2,]     71.68709     33
 [3,]   1349.28494    262
 [4,]  16246.30388  19950
 [5,]  62352.31007  58752
 [6,]  78830.29975  81650
 [7,]  58194.45510  57738
 [8,]  86826.12802  91793
 [9,] 112381.22602 104933
[10,]  79122.57374  89506
[11,]  36005.92395  30814
[12,]  17873.04781  19517
[13,]  11812.20054  13828
[14,]   7602.84587   5512
[15,]   4219.25155   4131
[16,]   2033.40298   1558
[17,]    903.35197   1769
[18,]    423.51988    430
[19,]    247.82985    205
[20,]    183.80917     89
[21,]    152.41480    137
[22,]    579.90848    631

integral <- sum((weight3$upper-weight3$lower)*weight3$Waarde)
plot(c(Waarde,Waarde[nrow(weight3)])*100/c(upper-lower,3500-2451) ~  
    c(lower,3500),data=weight3,
    type='s',xlim=c(0,3500),ylim=c(0,1.2e5),
    xlab='Weight',ylab='Frequency')
for(i in seq(1,10000,100)) {
  x <- seq(0,3500,10)
  y <- rep(0,length(x))
  newmodel <- data.frame(mean = result[i,1:4],
      sd=result[i,5:8],
      w=result[i,9:12])
  for (ir in 1:nrow(newmodel))
    y <- y +  newmodel$w[ir]*dnorm(x,newmodel$mean[ir],newmodel$sd[ir])
  lines(x=x,y=y/sum(y)/10*integral,col='red')
}

Sunday, June 2, 2013

Cars in Netherlands

I am looking for a new car. So when I saw there was an update on vehicles in Statistics Netherlands I just had to go and look at the data. So, I learned the brown is getting more popular, often the number of cars from a certain construction year is larger at six years of age than five years of age and lighter cars get more popular, especially in these later years in the crisis.

Color of Cars

The data I downloaded contains number of cars by color, build year and reference date. Unfortunately it is all in Dutch, but I did translate all relevant parts. The data actually contains 14 colors, including 'other' but some of the colors were so infrequent, it made all confusing. So I added some colors to other. To plot what is sold in a certain year, I took reference date 1 January, the year after building.
The data shown grey and black as most sold, white and brown do get more popular, blue and green get less popular. There is also a little dent in 2009 (crisis) and a new decrease in 2012. New tells us 2013 is worse.

% change in number of cars is interesting. I was expecting black cars maybe get more in accidents or something, because they are less visible. What I did see is a marked decrease in white cars, especially in the beginning of the previous decade. And there is a marked increase in cars of say 5 or 6 years old. Finally, at say ten years of age the cars start disappearing. The marked increase at 5 or 6 years may be explained by importing older cars. I sometimes read or hear they are bought in Germany. The Dutch tax on imported cars is fairly high, so it is interesting to import a cheaper second hand, which has significantly less tax. For example; Athlon car lease Germany has a Dutch language site especially for this purpose.

Fuel

There are three kinds of frequently used fuels; gasoline (Benzine), diesel and LPG (liquid propane gas). Gasoline is the standard, most expensive fuel, lowest in road tax, diesel is cheaper to run, but more expensive in tax, LPG even more so. It boils down to this, Diesel and LPG if you drive a lot, Gasoline if you drive a bit. It is possible to convert from Gasoline to LPG, which may explain the big increases in number of LPG for older cars.

Weight

Weight is also interesting. It is tax categories, these I translated into continuous values by taking the lower end of the brackets. The first bracket is 0 to 450 kg, then 100 kg increases. Beats the hell out of me what cars are under 650 kg, but there is something. Obviously most cars are between about 700 and 1800 kg. There is also some shift at 750 kg, I you have a good color discrimination you might see what exactly. My eyes won't so the next plot shows that cars 651 to 750 kg are getting less frequent, 751 to 850 much more frequent. There are tax advantages to less fuel consumption, and the economy can only strengthen that. Somewhat heavier cars suffer from a decrease. Above 1500 kg it seems the market works differently.

The final plots show number of  heaver cars are most decreasing when older. Which makes sense, they are still pretty expensive to run in terms of fuel and tax. The final plot I chose because you can see the big increase in love for small cars again, it would seem old small cars don't disappear or their number even increases, while the bigger cars, the 1151 to 1250 bracket has prominent decreases from 7 years old. 

R code.

library(ggplot2)      
col1 <- read.csv2('Motorvoertuigen__per_260513101348.csv',na.strings='-')
col2 <- col1[!is.na(col1$Waarde),]
col2$BuildYear <- as.numeric(sub('Bouwjaar ','',as.character(col2$Bouwjaren)))
col2$RefYear <- as.numeric(sub(', 1 januari','',as.character(col2$Peildatum)))
col2$Colour <- factor( c("Beige", "Blue", "Brown", "Other", "Yellow", 
        "Grey", "Green", "Other", "Other",
        "Other", "Red", "Other", "White" ,"Black")[col2$Onderwerpen_2])
col3 <- aggregate(col2$Waarde,list(Colour=col2$Colour,BuildYear=col2$BuildYear,
        RefYear=col2$RefYear),sum)
col4 <- col3[col3$RefYear==col3$BuildYear+1,]
colourcode <-  c("#C8AD7F", "Black","Blue", "Brown", "Green" ,
    "Grey" ,  "Purple" , "Red","White" , "Yellow")
png('col1.png')
p <- ggplot(col4, aes(x=BuildYear, y=x, colour=Colour))
p + geom_line() + 
    scale_colour_manual(values=colourcode) +
    scale_y_log10("Numer of vehicles") +
    scale_x_continuous(breaks=seq(2000,2012,2))
dev.off()

##############

lastyeardata <- col3[,c('x','BuildYear','Colour','RefYear')] 
lastyeardata$RefYear <- lastyeardata$RefYear+1
colnames(lastyeardata)[colnames(lastyeardata)=='x'] <- 'LastYearAmount'
change <- merge(x=col3,y=lastyeardata)
change$Pchange <- with(change,100*(x-LastYearAmount)/LastYearAmount)
change$Age <- change$RefYear-change$BuildYear
png('col2.png')
p <- ggplot(change[change$BuildYear<2010,], aes(x=Age, y=Pchange, colour=Colour))
p + geom_line() + 
    scale_colour_manual(values=colourcode) +
    scale_y_continuous("Numer of vehicles") +
    facet_wrap(~BuildYear,nrow=2)
dev.off()

############
fuel1 <- read.csv2('Motorvoertuigen__per_010613135050.csv',na.strings='-')
fuel2 <- fuel1[!is.na(fuel1$Waarde),]
fuel2$BuildYear <- as.numeric(sub('Bouwjaar ','',as.character(fuel2$Bouwjaren)))
fuel2$RefYear <- as.numeric(sub(', 1 januari','',as.character(fuel2$Peildatum)))
fuel4 <- fuel2[fuel2$RefYear==fuel2$BuildYear+1,]
png('fuel1.png')
p <- ggplot(fuel4, aes(x=BuildYear, y=Waarde, colour=Onderwerpen_2))
p + geom_line() + 
    scale_y_continuous("Numer of vehicles") +
    scale_x_continuous(breaks=seq(2000,2012,2)) +
    labs(colour="Fuel")
dev.off()
##

lastyeardata <- fuel2[,c('Waarde','BuildYear','Onderwerpen_2','RefYear')] 
lastyeardata$RefYear <- lastyeardata$RefYear+1
colnames(lastyeardata)[colnames(lastyeardata)=='Waarde'] <- 'LastYearAmount'
change <- merge(x=fuel2,y=lastyeardata)
change$Pchange <- with(change,100*(Waarde-LastYearAmount)/LastYearAmount)
change$Age <- change$RefYear-change$BuildYear
png('fuel2.png')
p <- ggplot(change[change$BuildYear<2010,], 
    aes(x=Age, y=Pchange, colour=Onderwerpen_2))
p + geom_line() + 
    scale_y_continuous("Numer of vehicles") +
    facet_wrap(~BuildYear,nrow=2) +
    labs(colour="Fuel")
dev.off()

##############

weight1 <- read.csv2('Motorvoertuigen__per_010613140907.csv',na.strings='-')
weight2 <- weight1[!is.na(weight1$Waarde),]
weight2$BuildYear <- as.numeric(sub('Bouwjaar ','',as.character(weight2$Bouwjaren)))
weight2$RefYear <- as.numeric(sub(', 1 januari','',as.character(weight2$Peildatum)))
weightcats <- levels(weight2$Onderwerpen_2)
weightcats <- gsub('en meer','and more',weightcats)
levels(weight2$Onderwerpen_2) <- weightcats 
lweightcats <- as.numeric(gsub('( |-).*$','',weightcats))
weight2$lweight <- lweightcats[weight2$Onderwerpen_2]
weightcats <- weightcats[order(lweightcats)]
weight2$WeightCat <- factor(weight2$Onderwerpen_2,levels=weightcats)
weight4 <- weight2[weight2$RefYear==weight2$BuildYear+1,]
png('weight1.png')
p <- ggplot(weight4, aes(x=lweight, y=Waarde, colour=factor(BuildYear)))
p + geom_line() + 
    scale_y_continuous("Numer of vehicles") +
    labs(colour='Build Year')
dev.off()

png('weight2.png')
p <- ggplot(weight4[weight4$lweight>600& weight4$lweight<1800,], aes(x=BuildYear,y=Waarde))
p + geom_line() + 
    scale_y_continuous("Numer of vehicles") + 
    facet_wrap(~WeightCat)
dev.off()
##

lastyeardata <- weight2[,c('Waarde','BuildYear','Onderwerpen_2','RefYear')] 
lastyeardata$RefYear <- lastyeardata$RefYear+1
colnames(lastyeardata)[colnames(lastyeardata)=='Waarde'] <- 'LastYearAmount'
change <- merge(x=weight2,y=lastyeardata)
change$Pchange <- with(change,100*(Waarde-LastYearAmount)/LastYearAmount)
change$Age <- change$RefYear-change$BuildYear
png('weight3.png')
p <- ggplot(change[change$lweight>600& change$lweight<2200 & change$BuildYear<2010,]
    , aes(x=Age, y=Pchange, colour=WeightCat))
p + geom_line() + 
    scale_y_continuous("% Chane in Numer of vehicles") +
    facet_wrap(~BuildYear,nrow=2)
dev.off()

png('weight4.png')
p <- ggplot(change[change$lweight>600 & change$lweight<1200,],
    aes(x=RefYear, y=Pchange, colour=factor(Age)))
p + geom_line() + 
    scale_y_continuous("% Change in Number of vehicles") +
    facet_wrap(~WeightCat)
dev.off()