Wiekvoet: January 2015

Sunday, January 25, 2015

Odds, odds ratio, language and intuition

I was reading a statistics book the other day. I am at the beginning. In this section I read '(we) report results as odds ratios, which is more intuitive'. I must have read sentences stating this any number of times. But I don't agree.
It may be my background. My mother tongue is Dutch, which does not have a word for odds. Hence odds is a word which I only learned when I got into statistics. It is an abstract mathematical construct. The odds ratio then, is a ratio between two abstract numbers and initially was just a number. Since I never did much in odds and odds ratio, they never gained meaning either.
Now I am just wondering, what proportion of people reading this kind of books does not have an intuition for odds?

Sunday, January 18, 2015

SAS PROC MCMC example in R: Logistic Regression Random-Effects Model

In this post I will run SAS example Logistic Regression Random-Effects Model in four R based solutions; Jags, STAN, MCMCpack and LaplacesDemon. To quote the SAS manual: 'The data are taken from Crowder (1978). The Seeds data set is a 2 x 2 factorial layout, with two types of seeds, O. aegyptiaca 75 and O. aegyptiaca 73, and two root extracts, bean and cucumber. You observe r, which is the number of germinated seeds, and n, which is the total number of seeds. The independent variables are seed and extract.'
The point of this exercise is to demonstrate a random effect. Each observation has a random effect associated with it. In contrast the other parameters have non-informative priors. As such, the models are not complex.
Previous post in the series PROC MCMC examples programmed in R were: example 1: sampling, example 2: Box Cox transformation, example 5: Poisson regression and example 6: Non-Linear Poisson Regression.

JAGS

To make things easy on myself I wondered if this data was already present as R data. That is when I discovered this post at Grimwisdom doing exactly what I wanted as JAGS program. Hence that code was the basis for this JAGS code.
One thing on the models and distributions. JAGS uses tau (1/variance) as hyperparameter for the delta parameters and tau has gamma distribution. In contrast, all other programs use standard deviation as hyperparameter for delta parameter and hence gets the inverse gamma distribution. This distribution is available in the MCMCpack package and also in STAN. Gelman has a publication on this kind of priors: Prior distributions for variance parameters in hierarchical models.
library(R2jags)
data(orob2,package='aod')
seedData <- list ( N = 21,
    r = orob2$y,
    n = orob2$n,
    x1 = c(1,0)[orob2$seed],
    x2 = c(0,1)[orob2$root]
)

modelRandomEffect <- function() {
for(i in 1:4) {alpha[i] ~ dnorm(0.0,1.0E-6)}
for(i in 1:N) {delta[i] ~ dnorm(0.0,tau)}

for (i in 1:N) {
    logit(p[i]) <- alpha[1] +
        alpha[2]*x1[i] +
        alpha[3]*x2[i] +
        alpha[4]*x1[i]*x2[i] +
        delta[i];
    r[i] ~ dbin(p[i],n[i]);
}
tau ~ dgamma(.01,0.01)
s2 <- 1/tau
}
params <- c('alpha','s2')
myj <-jags(model=modelRandomEffect,
    data = seedData,
    parameters=params)
myj
Inference for Bugs model at "/tmp/RtmpKURJBm/model70173774d0c.txt", fit using jags,
3 chains, each with 2000 iterations (first 1000 discarded)
n.sims = 3000 iterations saved
         mu.vect sd.vect   2.5%    25%     50%     75%   97.5% Rhat n.eff
alpha[1] -0.538   0.191 -0.907 -0.664 -0.542 -0.419 -0.135 1.005   440
alpha[2]   0.078   0.300 -0.550 -0.111   0.083   0.269   0.646 1.005   440
alpha[3]   1.342   0.284 0.768 1.164   1.340   1.524   1.895 1.004 1400
alpha[4] -0.807   0.441 -1.663 -1.098 -0.817 -0.521   0.046 1.009   300
s2         0.105   0.096 0.010 0.041   0.077   0.138   0.366 1.051    47
deviance 101.236   6.653 89.853 96.595 100.903 105.313 114.371 1.022    95

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 21.7 and DIC = 122.9
DIC is an estimate of expected predictive error (lower deviance is better).

MCMCpack

MCMCpack had some difficulty with this particular prior for s2. In the end I chose inverse Gamma(0.1,0.1) that worked. Therefor parameters estimates turn out slightly different. For conciseness only the first parameters are displayed in the output.
library(MCMCpack)
xmat <- cbind(rep(1,21),seedData$x1,seedData$x2,seedData$x1*seedData$x2)
MCMCfun <- function(parm) {
    beta=parm[1:4]
    s2=parm[5]
    delta=parm[5+(1:21)]
    if(s2<0 ) return(-Inf)
    step1 <- xmat %*% beta + delta
    p <- LaplacesDemon::invlogit(step1)
    LL <- sum(dbinom(seedData$r,seedData$n,p,log=TRUE))
    prior <- sum(dnorm(beta,0,1e3,log=TRUE))+
            sum(dnorm(delta,0,sqrt(s2),log=TRUE))+
            log(dinvgamma(s2,.1,.1))
    LP=LL+prior
    return(LP)
}
inits <- c(rnorm(4,0,1),runif(1,0,1),rnorm(21,0,1))
names(inits) <- c(paste('beta',0:3,sep=''),
        's2',
        paste('delta',1:21,sep=''))
mcmcout <- MCMCmetrop1R(MCMCfun,
        inits)
summary(mcmcout)
Iterations = 501:20500
Thinning interval = 1
Number of chains = 1
Sample size per chain = 20000

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

          Mean      SD Naive SE Time-series SE
[1,] -0.59046 0.30456 0.0021535        0.03613
[2,] 0.01476 0.47526 0.0033606        0.04866
[3,] 1.48129 0.38227 0.0027031        0.03883
[4,] -0.93754 0.66414 0.0046962        0.07025
[5,] 0.35146 0.08004 0.0005659        0.05020

STAN

Stan had no problems at all with this model.

library(rstan)
smodel <- '
data {
    int <lower=1> N;
    int n[N];
int r[N];
    real x1[N];
real x2[N];
}
parameters {
    real Beta[4];
    real <lower=0> s2;
real Delta[N];
}
transformed parameters {
    vector[N] mu;
    for (i in 1:N) {
    mu[i] <- inv_logit(
        Beta[1] + Beta[2]*x1[i] +
        Beta[3]*x2[i]+Beta[4]*x1[i]*x2[i]+
        Delta[i]);
    }
}
model {
        r ~ binomial(n,mu);
        s2 ~ inv_gamma(.01,.01);
        Delta ~ normal(0,sqrt(s2));
    Beta ~ normal(0,1000);
}
'
fstan <- stan(model_code = smodel,
        data = seedData,
        pars=c('Beta','s2'))
fstan
Inference for Stan model: smodel.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.

           mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
Beta[1]   -0.56    0.01 0.20   -0.97   -0.68   -0.55   -0.43   -0.14 1370 1.00
Beta[2]    0.08    0.01 0.33   -0.58   -0.12    0.08    0.29    0.72 1385 1.00
Beta[3]    1.36    0.01 0.29    0.83    1.18    1.36    1.54    1.97 1355 1.00
Beta[4]   -0.83    0.01 0.46   -1.76   -1.12   -0.82   -0.53    0.04 1434 1.00
s2         0.12    0.00 0.11    0.02    0.06    0.10    0.16    0.42   588 1.01
lp__    -523.70    0.34 6.86 -537.26 -528.19 -523.73 -519.21 -509.87   403 1.01

Samples were drawn using NUTS(diag_e) at Sat Jan 17 18:27:38 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

LaplacesDemon

While in the MCMCpack code I borrowed from LaplacesDemon the invlogit() function, in LaplacesDemon I borrow the InvGamma distribution. I guess it evens out. For a change no further tweaking in the code. Note that the core likelihood function is the same as MCMCpack. However, LaplacesDemon is able to use the correct prior. For brevity again part of the output has not been copied in the blog.
library('LaplacesDemon')
mon.names <- "LP"
parm.names <- c(paste('beta',0:3,sep=''),
        's2',
        paste('delta',1:21,sep=''))
PGF <- function(Data) {
    x <-c(rnorm(21+4+1,0,1))
    x[5] <- runif(1,0,2)
    x
}
MyData <- list(mon.names=mon.names,
        parm.names=parm.names,
        PGF=PGF,
        xmat = xmat,
        r=seedData$r,
    n=seedData$n)
N<-1
Model <- function(parm, Data)
{
    beta=parm[1:4]
    s2=parm[5]
    delta=parm[5+(1:21)]
#    if(s2<0 ) return(-Inf)
     step1 <- xmat %*% beta + delta
    p <- invlogit(step1)
    LL <- sum(dbinom(seedData$r,seedData$n,p,log=TRUE))
    tau <- 1/s2
    prior <- sum(dnorm(beta,0,1e3,log=TRUE))+
            sum(dnorm(delta,0,sqrt(s2),log=TRUE))+
            log(dinvgamma(s2,.01,.01))
    LP=LL+prior
    Modelout <- list(LP=LP, Dev=-2*LL, Monitor=LP,
            yhat=p,
            parm=parm)
    return(Modelout)
}
Initial.Values <- GIV(Model, MyData, PGF=TRUE)
Fit1 <- LaplacesDemon(Model,
        Data=MyData,
        Initial.Values = Initial.Values
)
Fit1
Call:
LaplacesDemon(Model = Model, Data = MyData, Initial.Values = Initial.Values)

Acceptance Rate: 0.67594
Algorithm: Metropolis-within-Gibbs
Covariance Matrix: (NOT SHOWN HERE; diagonal shown instead)
   beta0    beta1    beta2    beta3       s2   delta1   delta2   delta3
0.218082 0.218082 0.218082 0.218082 0.218082 0.218082 0.218082 0.218082
delta4   delta5   delta6   delta7   delta8   delta9 delta10 delta11
0.218082 0.218082 0.218082 0.218082 0.218082 0.218082 0.218082 0.218082
delta12 delta13 delta14 delta15 delta16 delta17 delta18 delta19
0.218082 0.218082 0.218082 0.218082 0.218082 0.218082 0.218082 0.218082
delta20 delta21
0.218082 0.218082

Covariance (Diagonal) History: (NOT SHOWN HERE)
Deviance Information Criterion (DIC):
         All Stationary
Dbar 100.063    100.063
pD    26.630     26.630
DIC 126.692    126.692
Initial Values:
[1] 1.385256609 -0.634946833 1.456635236 -0.041162276 1.883504417
[6] -1.380783003 -0.688367493 0.210060822 0.127231904 0.710367572
[11] -0.865780359 -1.649760777 -0.005532662 -0.114739142 0.642440639
[16] -0.919494616 -0.829018195 -0.938486769 0.302152995 -1.877933490
[21] 1.170542660 0.131282852 0.210852443 0.808779058 -2.115209547
[26] 0.431205368

Iterations: 10000
Log(Marginal Likelihood): -27.92817
Minutes of run-time: 0.81
Model: (NOT SHOWN HERE)
Monitor: (NOT SHOWN HERE)
Parameters (Number of): 26
Posterior1: (NOT SHOWN HERE)
Posterior2: (NOT SHOWN HERE)
Recommended Burn-In of Thinned Samples: 0
Recommended Burn-In of Un-thinned Samples: 0
Recommended Thinning: 240
Specs: (NOT SHOWN HERE)
Status is displayed every 100 iterations
Summary1: (SHOWN BELOW)
Summary2: (SHOWN BELOW)
Thinned Samples: 1000
Thinning: 10

Summary of All Samples
                  Mean        SD       MCSE       ESS            LB
beta0     -0.544001376 0.2305040 0.03535316 93.03421   -0.95266664
beta1      0.060270601 0.3458235 0.04387534 106.92244   -0.67629607
beta2      1.360903746 0.3086684 0.04838620 60.40225    0.76275725
beta3     -0.828532715 0.4864563 0.06323965 99.93646   -1.74744708
s2         0.134846805 0.1055559 0.01599649 68.64912    0.02549966
(...)
Summary of Stationary Samples
                  Mean        SD       MCSE       ESS            LB
beta0     -0.544001376 0.2305040 0.03535316 93.03421   -0.95266664
beta1      0.060270601 0.3458235 0.04387534 106.92244   -0.67629607
beta2      1.360903746 0.3086684 0.04838620 60.40225    0.76275725
beta3     -0.828532715 0.4864563 0.06323965 99.93646   -1.74744708
s2         0.134846805 0.1055559 0.01599649 68.64912    0.02549966

Sunday, January 11, 2015

Multivariate analysis of death rate on the map of Europe

Eurostat has information on death rates by cause and NUTS 2 region. I am trying to get this visually displayed on the map. To get there I map all causes to three dimensions via a principal components analysis. These three dimensions are subsequently translated in RGB colors and placed in the map of Europe.

Data

Death rates are from Eurostat table causes of death. I took crude death rate. On that website, from default setup I removed gender and added causes. Then I transposed causes to rows and regions to columns so I would not be bothered by translation of invalid column names. I exported those as .xls, but got only part of the regions. My second attempt was export as .csv. In general I prefer to take .xls from Eurostat as they separate thousands via a ',', however, incomplete was not acceptable, so the .csv is used.
For the mapping the scipt used is based on rpubs: Mapping Data from Eurostat using R. However, some changes were needed as Eurostat reorganized there website. Another adaptation is that I removed all Frances' overseas parts. These parts gave too much whitespace for my taste.

Result

The resulting map is nice to see, there is certainly a structure. Regions close to each other have similar colors hence similar death rates. However color interpretation itself is difficult. Most important of all, these are comparisons between regions.

To explain the colors I did something similar as for the map itself. In order to avoid crowding the figure, only forty of the causes are displayed.

Code

library("rgdal")
library("RColorBrewer")
library("GISTools")
library("classInt")
library(dplyr)

temp <- tempfile(fileext = ".zip")
# now download the zip file from its location on the Eurostat website and
# put it into the temp object
# new Eurostat website
# old: http://epp.eurostat.ec.europa.eu
# new: http://ec.europa.eu/eurostat
download.file("http://ec.europa.eu/eurostat/cache/GISCO/geodatafiles/NUTS_2010_60M_SH.zip",
temp)
# now unzip the boundary data
unzip(temp)

EU_NUTS <- readOGR(dsn = "./NUTS_2010_60M_SH/data", layer = "NUTS_RG_60M_2010")
ToRemove <- EU_NUTS@data$STAT_LEVL!=2 | grepl('FR9',EU_NUTS@data$NUTS_ID)
EUN <- EU_NUTS[!ToRemove,]

###
dir()
r1 <- read.csv('hlth_cd_acdr2_1_Data.csv')
r1$Value <- as.character(r1$Value) %>%
gsub(',','',.) %>%
gsub(':','0',.) %>%
as.numeric(.)
r2 <- reshape(r1,direction='wide',
idvar=c('ICD10','ICD10_LABEL'),
v.names='Value',
timevar='GEO',
drop=c('UNIT','TIME','AGE','SEX','Flag.and.Footnotes'))
names(r2) <- gsub('Value.','',names(r2),fixed=TRUE)

r2 <- r2[!(r2$ICD10 %in%
c('A-R_V-Y','A-B','C','E','F','G-H','I','J','K','M','N','R','V01-Y89','ACC'))
,]
row.names(r2) <- r2$ICD10_LABEL
m1 <- as.matrix(r2[,-(1:2)]) %>% t(.)

m2 <- m1[rownames(m1) %in% EUN@data$NUTS_ID,]
pr1 <- princomp(m2,cor=TRUE)
#plot(pr1)
#biplot(pr1)
#plot(pr1$loadings[,1],pr1$loadings[,2])
myscale <-function(x)
(x-min(x))/(max(x)-min(x))

tom <- data.frame(
loc=as.character(rownames(pr1$scores)),
rgbr=myscale(pr1$scores[,1]),
rgbg=myscale(pr1$scores[,2]),
rgbb=myscale(pr1$scores[,3]))
tom$rgb <- with(tom,rgb(rgbr,rgbg,rgbb))

EUN@data = data.frame(EUN@data[,1:4], tom[
match(EUN@data[, "NUTS_ID"],tom[, "loc"]), ])

png('map.png')
par(mar=rep(0,4))
plot <- plot(EUN, col = EUN@data$rgb,
axes = FALSE, border = NA)
dev.off()
load <- as.data.frame(as.matrix(pr1$loadings[,1:3]))
load$name <- rownames(pr1$loadings)
load <- mutate(load,
rgbr=myscale(Comp.1),
rgbg=myscale(Comp.2),
rgbb=myscale(Comp.3),
rgb =rgb(rgbr,rgbg,rgbb))
range(load$Comp.1)
png('legend2.png',
width=960,
height=960,
res=144)
par(mar=rep(0,4))
with(load,plot(x=Comp.1,y=Comp.2,type='n',
xlim=range(Comp.1)*1.2))
sample_n(load,40) %>%
with(.,text(x=Comp.1,y=Comp.2,
labels=name,
col=rgb,cex=.5))
dev.off()

Sunday, January 4, 2015

Capital in the Netherlands, 2006-2013

I guess most people have heard of Piketty and his book capital in the twenty-First century. I don't have that book, but the media attention has made me wonder if I could see any trends in Dutch public data. As I progressed with this post, I have concluded that these data could not tell me much about longer term trends, however, one can see very well most people have been getting more poor as the crisis continued.

Data

Data are acquired from Statistics Netherlands. They have Statline (the electronic databank of Statistics Netherlands. It enables users to compile their own tables and graphs. The information can be accessed, printed and downloaded free of charge). There is both a Dutch and an English version of the website. Above I did link to the English part, but I could only find the vermogensklassen table in the Dutch language section. As far as I understand estate is a better word than capital for 'vermogen'. It is intended to reflect the net value of all a persons assets.
Since the contents of the tables are also in Dutch, I will be translating, however, it is not so easy, not everything has an English language equivalent.
It should also be noted that these tables are not the most ideal in the 'Piketty context'. For one thing the time range is to short, for another the detail in the upper capital range is too low.
Finally, there are three sets of variables within the tables; income categories (ten categories, low to high), income source and capital itself. There is one table which crosses income categories and income source, this table provides counts, means and quantiles. The other table has capital categories, but only by either income categories or income source.
After suitable selection of data ranges I have chosen the .csv versions to export the data.
As final notes, the data for 2011 has two versions, before and after a method change. When using years in a display, the former got year 2010.9 and the latter 2011.1. The data for 2013 is preliminary.
Regarding income source, I translated as follows:
c('1 Working','1.1 Employees',
        '1.2 Civil Servant','1.3 Other working',
        '2 Own Business','3 Welfare',
        '3.1 Income Insurance','3.1.1 Unemployment',
        '3.1.2 Illness','3.1.3 Retirement',
        '3.2 Social','3.2.1 Sustention','3.2.2 Other Social',
        '3.3 Other welfare')
There is a structure, Working has three sub categories, Welfare has two levels. It should be explained that 3.1 and its sub categories are paid for by employees while working. Unemployment has limited duration, illness is, as far as I understand, a combination of short and long term, the rules have changed a bit, I am not 100% sure. The category 3.2.1 is 'bijstand' which is the last resort for which one can only apply if there are no other sources of income.

Extremes in the data

The presence of both a mean and a 75% quantile gives possibility to finding subsections of data where the mean is larger than the 75% quantile. Below a selection of these data where the quantile is also larger than 100000 and from years 2007 and 2013, ordered by ration mean/75% quantile.
          income            Source count year   Mean    p50
1         2e 10%    2 Own Business 35000 2013 152000 15000
2         3e 10% 1.3 Other working   4000 2013 203000 11000
3         4e 10% 1.3 Other working   5000 2013 239000 23000
4   1e 10% (low)    2 Own Business 92000 2013 281000 25000
5 10e 10% (high)    2 Own Business 221000 2013 816000 283000
6 10e 10% (high)    2 Own Business 205000 2007 891000 380000
7   1e 10% (low) 3.1.3 Retirement 83000 2013 207000 24000
8         2e 10% 1.3 Other working   4000 2013 257000   6000
9   1e 10% (low) 1.3 Other working 11000 2013 474000   6000
We can observe that these extremes mostly occur in 2013.
There is no data with mean lower than the 25% quantile, however, we can select all data with the first 25% quantile lower than 0. These are mostly working people from the 6th to 8th income category in 2013. I would guess these people have house which lost value while the mortgage did not. It should be noted that mid 2014 house prices did increase again, but since the data reflect information on 1st January, it will be some years ere that is reflected in these tables . On radio today I heard that currently one third of the houses is worth less than the mortgage on it.
   income            Source count   year   Mean    p25    p50
1 3e 10% 1.2 Civil Servant 22000 2013.0 20000 -13000   2000
2 7e 10%     1.1 Employees 403000 2013.0 55000 -18000   9000
3 7e 10%         1 Working 484000 2013.0 63000 -18000 10000
4 7e 10% 1.2 Civil Servant 67000 2013.0 67000 -19000 18000
5 6e 10%     1.1 Employees 361000 2013.0 46000 -11000   5000
6 8e 10% 1.2 Civil Servant 88000 2013.0 82000 -16000 32000
7 6e 10%         1 Working 425000 2013.0 54000 -9000   6000
8 8e 10%     1.1 Employees 413000 2013.0 76000 -12000 24000
9 8e 10%         1 Working 520000 2013.0 85000 -13000 27000
10 7e 10% 1.2 Civil Servant 64000 2010.9 82000 -3000 30000
11 7e 10% 1.3 Other working 13000 2013.0 289000 -6000 89000
12 8e 10% 1.3 Other working 19000 2013.0 294000 -2000 114000
13 4e 10% 1.3 Other working   5000 2013.0 239000 -1000 23000
From these data at least it seems that 2013 has quite more extremes than 2007, both at the high and the low side.

Plots of trends

Using these same data it is also possible to make plots. I have chosen only the lowest and highest income categories, otherwise the sub plots became too small.
In the plot the dots are the means, the lines the medians and the bars the 25% and 75% quantile.

In case anybody wonders what why this welfare category is so rich, that is mostly the retired. I suppose these people have worked hard in their life, got a good retirement fund bought a house of which the mortgage is paid. I have no explanation for the low income business owners who still seem worth on average 250000 Euros, their mean estate is more than the low income business owners of the next income segment. They did not lose value that much either, in contrast to the 9th income group of working people who have lower mean value and lose value.

Plots of estate categories

There are some difficulties in making these displays. The categories have wildly different widths. The lowest and highest categories are unbounded. Finally, the 0 to 5000 Euro category is by far most populous. The net effect is then that we get a figure with high peak, long tails where density is unclear and all detailed cropped up together. Hence in the end I dropped histograms.
The figure below shows trends for income categories while selecting the lowest estate categories. It is clear that especially the number of people with a clear negative estate is growing. People are not entering the small negative category either, debtors in general owe more that 10000 Euro.

In the highest estate categories again most categories are getting empty. The one exception is the 1 million Euro plus, the number of Euro millionaires increased in 2013. In this latter category somehow the lowest income group relatively often present.

Categories by income source

Again in the lowest category we see a big increase in people with debt. The trend seems to be that the number of working people with debt increases quickly.

For the highest categories we see the corresponding effect. All categories go down, except the richest people. I have chosen to add the retirement category in this plot, hence it is visible that in these categories it is mostly the retired who make up the welfare category

Status in 2013

The final plot shows all estate categories for the three income sources. In this plot we see that for business owners the largest category is the debtors. For the working people the most frequent categories are 0 to 5000 Euro and more than 10 000 Euro debt.
We can also see a split. Beside the debtors there are the haves, with more than 100 000 Euros and the have a little bit, with less than 30 000 Euro.

Code

Part 1

This is the code used for the mean data based results.
library(ggplot2)
library(dplyr)
r1 <- readLines('Gemiddeld_vermogen___241214145457.csv')
r1[6303]
r1 <- r1[-6303]
c1 <- read.csv(text=r1,skip=1,sep=';',na.strings='.')
uu <- as.character(unique(c1[,1]))
c1$income <- factor(c1[,1],levels=uu)
c1$value <- as.numeric(as.character(c1$Waarde))*1000
c1$period <- factor(c1$Period,levels=levels(c1$Period)[c(1:5,7,6,8,9)])
names(c1)[3] <- 'Source'
levels(c1$Source) <- c('1 Working','1.1 Employees',
        '1.2 Civil Servant','1.3 Other working',
        '2 Own Business','3 Welfare',
        '3.1 Income Insurance','3.1.1 Unemployment',
        '3.1.2 Illness','3.1.3 Retirement',
        '3.2 Social','3.2.1 Sustention','3.2.2 Other Social','3.3 Other welfare')

number <- c1[c1$Onderwerpen_1=='Huishoudens met vermogen',]
number$count <- number$value
number <- subset(number,,c(income,period,Source,count))
amount <- subset(c1,c1$Onderwerpen_1!='Huishoudens met vermogen')
amountw <- reshape(amount,
        v.names=c('value'),
        timevar='Onderwerpen_2',
        varying=list(c('Mean','p25','p50','p75')),
        idvar=c('income','period','Source'),
        direction='wide',
        drop=c('Onderwerpen_1','Inkomensgroepen',
                'Perioden','Waarde.eenheid','Waarde'))

r2 <- merge(number,amountw)
r2$year <- as.numeric(substr(r2$period,1,4))
r2$year[r2$period=='2011 na methodewijziging'] <- 2011.1
r2$year[r2$period=='2011 voor methodewijziging'] <- 2010.9
levels(r2$income) <- gsub('inkomen','',levels(r2$income)) %>%
        gsub('-groep','',.,fixed=TRUE) %>%
        gsub('laag ','low',.) %>%
        gsub('hoog ','high',.) %>%
        gsub(' ()','',.,fixed=TRUE)

filter(r2,!is.na(Mean) , Mean>p75,year %in% c(2007,2013),p75>100000) %>%
        arrange(.,Mean/p75) %>%
        select(.,income, Source,count,year,Mean,p50)

filter(r2,!is.na(Mean) , Mean<p25) %>%
        arrange(.,-Mean/p25) %>%
        select(.,income, Source,count,year,Mean,p50)
filter(r2,!is.na(Mean) , p25<0) %>%
        arrange(.,-Mean/p25) %>%
        select(.,income, Source,count,year,Mean,p25,p50)

# first figure
g1 <- ggplot(r2[as.numeric(r2$Source) %in%
                        c(1,5,6) &
                        as.numeric(r2$income) %in% c(1,2,9,10),],
        aes(x=year,
                y=p50,
                ymin=p25,
                ymax=p75,
                col=Source))
g1 + geom_errorbar(position = "dodge") +
        geom_line()+
        facet_wrap(~ income,nrow=2) +
        guides(col=guide_legend(ncol=3)) +
        theme(legend.position='bottom',
                legend.direction ='vertical',
                text=element_text(size=10,
                )
        ) +
        geom_point(aes(x=year,y=Mean,Col=Source)) +ylab('Euro')

Part 2

This is the code using the estate categories. It starts with a load of data reorganisation. This is because the data sits in columns for income categories, income sources and both of these having amounts and counts. It starts with a section for income categories (low to high) after which a similar section for income sources is used.
library(ggplot2)
library(dplyr)
cvk <- readLines('Vermogensklassen__hu_241214141933.csv')
cvk.head <- as.matrix(read.csv2(text=cvk[2:5],header=FALSE))
cvk.body <- read.csv2(text=cvk[5:(length(cvk)-1)])
keepcol <- c(colnames(cvk.head)[cvk.head[3,]=='Aantal' &
                        grepl('groep',cvk.head[1,])])

colnames(cvk.body)[c(1,2)] <- c(cvk.head[4,1:2])
mytimes <- cvk.head[1,keepcol]
cvk.body <- cvk.body[,c(1,2,which(cvk.head[3,]=='Aantal' &
                                grepl('groep',cvk.head[1,])))]
#head(cvk.body)
cvk <- reshape(cvk.body,
        varying=list(names(cvk.body)[-2:-1]),
        v.names='Count',
        timevar='income',
        idvar=c('Vermogensklassen','Perioden'),
        direction='long',
        times=mytimes
)
rownames(cvk) <- 1:nrow(cvk)
cvk$income <- factor(cvk$income,levels=unique(cvk$income))
levels(cvk$income) <- gsub('inkomen','',levels(cvk$income)) %>%
        gsub('-groep','',.,fixed=TRUE) %>%
        gsub('laag ','low',.) %>%
        gsub('hoog ','high',.) %>%
        gsub(' ()','',.,fixed=TRUE)
head(cvk)

cvk$Period <- factor(cvk$Perioden,levels=levels(cvk$Perioden)[c(1:5,7,6,8,9)])
cvk$year <- as.numeric(substr(cvk$Period,1,4))
cvk$year[cvk$Period=='2011 na methodewijziging'] <- 2011.1
cvk$year[cvk$Period=='2011 voor methodewijziging'] <- 2010.9

#linux
cvk$EstateCat <- factor(cvk$Vermogensklassen,
        levels(cvk$Vermogensklassen)[c(1,14,4,2,3,5:13,16:24,15)])
#win
#cvk$EstateCat <- factor(cvk$Vermogensklassen,
#    levels(cvk$Vermogensklassen)[c(3,1,2,4:14,16:24,15)])

levels(cvk$EstateCat) <- gsub('Vermogen','',levels(cvk$EstateCat)) %>%
#gsub('000','',.) %>%
gsub('tot','to',.) %>%
gsub('miljoen','million',.) %>%
gsub('en meer','plus',.) %>%
gsub('[[:space:]]+',' ',.)

#xtabs(~ EstateCat + Vermogensklassen,data=cvk)
############
# plotting phase of income categories
############
# low
p1 <- ggplot(cvk[as.numeric(cvk$EstateCat)<7,],aes(y=Count,x=year,col=income))
p1+geom_line()+
        facet_wrap(~ EstateCat) +
        guides(col=guide_legend(ncol=5)) +
        theme(legend.position='bottom',
                legend.direction ='vertical',
                text=element_text(size=10,
                )) +
        ylab('Count (*1000) ')
dev.off()
# high
p1 <- ggplot(cvk[as.numeric(cvk$EstateCat)>18,],aes(y=Count,x=year,col=income))
p1+geom_line()+
        facet_wrap(~ EstateCat) +
        guides(col=guide_legend(ncol=5)) +
        theme(legend.position='bottom',
                legend.direction ='vertical',
                text=element_text(size=10,
                )) +
        ylab('Count (*1000) ')

################
# income sources
################
#cvk.head <- as.matrix(read.csv2(text=cvk[2:5],header=FALSE))
cvk <- readLines('Vermogensklassen__hu_241214141933.csv')
cvk.body <- read.csv2(text=cvk[5:(length(cvk)-1)])
keepcol <- c(colnames(cvk.head)[cvk.head[3,]=='Aantal' &
                        !grepl('groep',cvk.head[1,])])
#keepcol
colnames(cvk.body)[c(1,2)] <- c(cvk.head[4,1:2])
mytimes <- cvk.head[1,keepcol]
cvk.body <- cvk.body[,c(1,2,which(cvk.head[3,]=='Aantal' &
                                !grepl('groep',cvk.head[1,])))]
#head(cvk.body)
cvit <- reshape(cvk.body,
        varying=list(names(cvk.body)[-2:-1]),
        v.names='Count',
        timevar='Source',
        idvar=c('Vermogensklassen','Perioden'),
        direction='long',
        times=mytimes
)
rownames(cvit) <- 1:nrow(cvit)
#head(cvit)
#unique(cvit$Source)
cvit$Source <- factor(cvit$Source)
levels(cvit$Source) <- c('1 Working','1.1 Employees',
        '1.2 Civil Servant','1.3 Other working',
        '2 Own Business','3 Welfare',
        '3.1 Income Insurance','3.1.1 Unemployment',
        '3.1.2 Illness','3.1.3 Retirement',
        '3.2 Social','3.2.1 Sustention','3.2.2 Other Social','3.3 Other welfare')
#levels(cvit$Perioden)
cvit$Period <- factor(cvit$Perioden,levels=levels(cvit$Perioden)[c(1:5,7,6,8,9)])
cvit$year <- as.numeric(substr(cvit$Period,1,4))
cvit$year[cvit$Period=='2011 na methodewijziging'] <- 2011.1
cvit$year[cvit$Period=='2011 voor methodewijziging'] <- 2010.9

levels(cvit$Vermogensklassen)
cvit$EstateCat <- factor(cvit$Vermogensklassen,
        levels(cvit$Vermogensklassen)[c(1,14,4,2,3,5:13,16:24,15)])
#levels(cvit$EstateCat)
#win
#cvit$EstateCat <- factor(cvit$Vermogensklassen,
#        levels(cvit$Vermogensklassen)[c(3,1,2,4:14,16:24,15)])

levels(cvit$EstateCat) <- gsub('Vermogen','',levels(cvit$EstateCat)) %>%
#gsub('000','',.) %>%
        gsub('tot','to',.) %>%
        gsub('miljoen','million',.) %>%
        gsub('en meer','plus',.) %>%
        gsub('[[:space:]]+',' ',.)

# figure 5
p1 <- ggplot(cvit[as.numeric(cvit$Source) %in% c(1,5,6) &
                        as.numeric(cvit$EstateCat) <7 ,],
        aes(y=Count,x=year,col=Source))
p1+geom_line()+
        facet_wrap(~ EstateCat) +
        guides(col=guide_legend(ncol=5)) +
        theme(legend.position='bottom',
                legend.direction ='vertical',
                text=element_text(size=10,
                )) +
        ylab('Count (*1000) ')

# figure 6
p1 <- ggplot(cvit[as.numeric(cvit$Source) %in% c(1,5,6,10) &
                        as.numeric(cvit$EstateCat) >18 ,],
        aes(y=Count,x=year,col=Source))
p1+geom_line()+
        facet_wrap(~ EstateCat) +
        guides(col=guide_legend(ncol=5)) +
        theme(legend.position='bottom',
                legend.direction ='vertical',
                text=element_text(size=10,
                )) +
        ylab('Count (*1000) ')

# figure 2013
p1 <- ggplot(cvit[as.numeric(cvit$Source) %in% c(1,5, 6) & cvit$year==2013 ,],
        aes(y=Count,x=EstateCat))
p1+geom_point()+
        facet_wrap(~ Source) +
        guides(col=guide_legend(ncol=4)) +
        theme(legend.position='bottom',
                legend.direction ='vertical',
                text=element_text(size=10,
                ))+
        ylab('Count (*1000) ') +
        coord_flip()