Sunday, June 16, 2013

Distribution of car weights

Two weeks ago I described car data, among which weight distribution of cars in Netherlands. At that time it was purely plots. In the mean time I decided I wanted to model trends. As a first step of that, I decided to fit distributions for these data. I soon found out that most of the tools seem to be organized around individual objects as units. So, the 22 categories of weight might be extended to 100 000 records or so, one for each car. That is clearly not convenient. So, I brewed a MCMC algorithm for the Bayesian estimation of the distribution without all that. As I did most of this while on vacation, I had minimum access to books/internet, so it may be a bit clunky.

Data reading

Nothing new here, except I added manually a formal endpoint for the final weight category; 3500 kg. This number chosen because this is the maximum weight you may drive with a normal drivers license here.
weight1 <- read.csv2('Motorvoertuigen__per_010613140907.csv',na.strings='-')
weight2 <- weight1[!is.na(weight1$Waarde),]
weight2$BuildYear <- as.numeric(sub('Bouwjaar ','',as.character(weight2$Bouwjaren)))
weight2$RefYear <- as.numeric(sub(', 1 januari','',as.character(weight2$Peildatum)))
weightcats <- levels(weight2$Onderwerpen_2)
weightcats <- gsub('en meer','and more',weightcats)
levels(weight2$Onderwerpen_2) <- weightcats 
lweightcats <- as.numeric(gsub('( |-).*$','',weightcats))
uweightcats <- as.numeric(gsub('(^[0-9]+-)|([ [:alpha:]]+$)','',weightcats))
uweightcats[uweightcats==2451] <- 3500
weight3 <- weight2[weight2$RefYear==2001,c(2,6)]
weight3$lower <- lweightcats[weight3$Onderwerpen_2]
weight3$upper <- uweightcats[weight3$Onderwerpen_2]
weight3
        Onderwerpen_2 Waarde lower upper
14           0-450 kg     95     0   450
196        451-550 kg     33   451   550
378        551-650 kg    262   551   650
560        651-750 kg  19950   651   750
742        751-850 kg  58752   751   850
924        851-950 kg  81650   851   950
1106      951-1050 kg  57738   951  1050
1288     1051-1150 kg  91793  1051  1150
1470     1151-1250 kg 104933  1151  1250
1652     1251-1350 kg  89506  1251  1350
1834     1351-1450 kg  30814  1351  1450
2016     1451-1550 kg  19517  1451  1550
2198     1551-1650 kg  13828  1551  1650
2380     1651-1750 kg   5512  1651  1750
2562     1751-1850 kg   4131  1751  1850
2744     1851-1950 kg   1558  1851  1950
2926     1951-2050 kg   1769  1951  2050
3108     2051-2150 kg    430  2051  2150
3290     2151-2250 kg    205  2151  2250
3472     2251-2350 kg     89  2251  2350
3654     2351-2450 kg    137  2351  2450
3836 2451 kg and more    631  2451  3500

Jumping through mixture space

I had decided the model will consist of three or four different normal distributions with different weights. These weights obviously sum to 1, so it is not obvious how to jump in this space while keeping to the rule that a jump from A to B, must be just as probable as a jump from B to A. It would also be nice to have weights between 0 and 1. In the end I decided to do this per mixture component. Take one component, move it to logit space, take the jump, move it back. Shrink/inflate all others proportionally to achieve sum of 1.  The plot below shows a trace through mixture space.

nspacejump <- function(x,i,sd) {
xold <- arm::logit(x[i])
xnew <- arm::invlogit(xold + rnorm(1,0,sd))
x <- x/sum(x[-i])*(1-xnew)
x[i] <- xnew
x
}

x <- rep(1/3,3)
xmat <- matrix(0,nrow=30000,ncol=3)
index <- 1
for (irow in 1:10000) {
for (icol in 1:3) {
xmat[index,] <- x
x <- nspacejump(x,icol,.1)
   index <- index + 1
}
}

plot(x=c(1,-1,0,1),y=c(0,0,sqrt(3),0),type='l')
lines(x=xmat[,1]-xmat[,2],y=xmat[,3]*sqrt(3),col='blue')

The model

It appeared as final solution to be most convenient to estimate expected proportions for all weight categories and examine how well proportions fit. Priors for mean weight of cars, 0 to 3500, standard deviation of weight of cars normal(300,300), for weight of normal distributions uniform over mixture space. Obviously only the standard deviation is implemented, it functions to avoid a solution with one normal distribution having enormous standard deviation, which seems not realistic.

curldev <- function(counts,curmodel) {
  dsd <- sum(dnorm(curmodel$sd,300,300,log=TRUE))
prop.exp <- matrix(0,nrow=nrow(counts),ncol=nrow(curmodel))
for (i in 1:nrow(curmodel)) {
prop.exp[,i] <- curmodel$w[i]*(pnorm(counts$upper,
                                          curmodel$mean[i],curmodel$sd[i])
                                   -pnorm(counts$lower,
                                          curmodel$mean[i],curmodel$sd[i]))
}
finprop <- rowSums(prop.exp)
  dsd + dmultinom(counts$Waarde,prob=finprop,log=TRUE)
}

Fitting

It is not a very smart algorithm. It took maybe 60000 iterations or more before I had the feeling it had converged and was sampling from the final distribution. Luckily it is just small data in this setup. For examining convergence I made plots of deviation, means and standard deviations over iterations. The third time I did the calculation I got lazy, a 100 000 run as burn in subsequently 10 000 iterations for results..
curmodel <- data.frame(
mean=c(1000,1100,1200,1500),
sd=c(200,200,200,200),
w=c(.25,.25,.25,.25)
)

cnow <- curldev(weight3,curmodel)
ndist <- nrow(curmodel)
niter <- 10000 # you need more, say 100 000 for convergence
result <- matrix(nrow=ndist*niter,ncol=3*ndist+1)
index <- 1

for (iter in 1:niter) {
for (dist in 1:ndist) {
nowmodel <- curmodel
result[index,] <- c(unlist(curmodel),cnow)
curmodel$mean[dist] <- curmodel$mean[dist] + rnorm(1,0,1)
curmodel$sd[dist] <- curmodel$sd[dist] * runif(1,1/1.01,1.01)
curmodel$w <- nspacejump(curmodel$w,dist,.1)
cnew <- curldev(weight3,curmodel)
# cat(c(cnow,cnew,'\n'))  
if (exp(cnew-cnow) > runif(1) ) cnow <- cnew
else curmodel <- nowmodel
index <- index + 1
}
curmodel
       mean       sd          w
1  864.3793  87.4235 0.31131942
2 1792.4924 650.7524 0.00655972
3 1188.4141 113.8444 0.51915833
4 1407.4449 234.6751 0.16296252

Result

First a table of fit of the final MCMC sample. It looks like the result is close to observed value. I also added a plot of the distributions where I thinned to 1 in 100 MCMC samples. Hence the red line looks a bit bold. Integral is a constant used to ensure the surface under the sample lines stays constant and is in line to the observed values.
rescounts <- function(counts,curmodel) {
  prop.exp <- matrix(0,nrow=nrow(counts),ncol=nrow(curmodel))
  for (i in 1:nrow(curmodel)) {
    prop.exp[,i] <- curmodel$w[i]*
         (pnorm(counts$upper,curmodel$mean[i],curmodel$sd[i])
         -pnorm(counts$lower,curmodel$mean[i],curmodel$sd[i]))
  }
  finprop <- rowSums(prop.exp)
  cbind(finprop*sum(counts$Waarde),counts$Waarde)
}
rescounts(weight3,curmodel)

              [,1]   [,2]
 [1,]     65.92410     95
 [2,]     71.68709     33
 [3,]   1349.28494    262
 [4,]  16246.30388  19950
 [5,]  62352.31007  58752
 [6,]  78830.29975  81650
 [7,]  58194.45510  57738
 [8,]  86826.12802  91793
 [9,] 112381.22602 104933
[10,]  79122.57374  89506
[11,]  36005.92395  30814
[12,]  17873.04781  19517
[13,]  11812.20054  13828
[14,]   7602.84587   5512
[15,]   4219.25155   4131
[16,]   2033.40298   1558
[17,]    903.35197   1769
[18,]    423.51988    430
[19,]    247.82985    205
[20,]    183.80917     89
[21,]    152.41480    137
[22,]    579.90848    631

integral <- sum((weight3$upper-weight3$lower)*weight3$Waarde)
plot(c(Waarde,Waarde[nrow(weight3)])*100/c(upper-lower,3500-2451) ~  
    c(lower,3500),data=weight3,
    type='s',xlim=c(0,3500),ylim=c(0,1.2e5),
    xlab='Weight',ylab='Frequency')
for(i in seq(1,10000,100)) {
  x <- seq(0,3500,10)
  y <- rep(0,length(x))
  newmodel <- data.frame(mean = result[i,1:4],
      sd=result[i,5:8],
      w=result[i,9:12])
  for (ir in 1:nrow(newmodel))
    y <- y +  newmodel$w[ir]*dnorm(x,newmodel$mean[ir],newmodel$sd[ir])
  lines(x=x,y=y/sum(y)/10*integral,col='red')
}

No comments:

Post a Comment