## Sunday, March 13, 2016

### Happy PI day

I have never done a post for PI day. This year I want to do so.

So, we all know the simple estimation of PI based on random numbers. The code used here is chosen for speed in R.
pi2d <- function(N=1000) {
4*sum(rowSums(matrix(runif(N*2)^2,ncol=2))<1)/N
}
What irritates me, is the low efficiency of this estimate. What do you get for 10 000 simulations? Probably, but not even certain, the first two digits.
summary(sapply(1:1000,function(x) pi2d(10000)))
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
3.080   3.130   3.141   3.141   3.153   3.189
In the past years I have been thinking how to get that more efficient, but that is not obvious. For instance, it is possible to use the three dimensional equivalent, a ball:
pi3d <- function(N=1000) {
6*sum(rowSums(matrix(runif(N*3)^2,ncol=3))<1)/N
}
summary(sapply(1:1000,function(x) pi3d(10000)))
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
3.052   3.121   3.140   3.142   3.161   3.243
This is even worse, the variation is higher.

At some point I thought this is due to the limited information in such a calculation, it is binomial and one simulation gives one bit of information. And it could be more simple. If the first random number is known, say y, then all second random numbers over sqrt(1-y2) give distance larger than 1, while the remainder gives distance less than 1. Thus should pi be equal to the mean of random numbers transformed like sqrt(1-y2)?
pin <- function(N=1000) {
4*sum(sqrt(1-runif(N)^2))/N
}
summary(sapply(1:1000,function(x) pin(10000)))
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
3.113   3.135   3.142   3.141   3.147   3.171
These numbers are closer, but there are additional calculations. Hence the number of simulations should be adapted to reflect the work done. Luckily we have microbenchmark() to calibrate this. After a bit of experimenting, these are the number of simulations giving roughly equivalent computation times.
microbenchmark(pi2d(10000),pi3d(6666),pin(22000))
Unit: milliseconds
expr      min       lq     mean   median       uq      max neval
pi2d(10000) 2.419106 2.436333 2.630764 2.458325 2.500477 5.253860   100
pi3d(6666) 2.361928 2.382820 2.557150 2.418006 2.467855 4.970898   100
pin(22000) 2.448429 2.468954 2.555823 2.485815 2.517703 5.023678   100
As can be seen, the third calculation actually has more simulations. Hence it is much more efficient to obtain the estimate.
summary(sapply(1:100,function(x) pi2d(10000)))
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
3.111   3.132   3.141   3.142   3.152   3.175
summary(sapply(1:100,function(x) pi3d(6666)))
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
3.046   3.116   3.142   3.139   3.165   3.230
summary(sapply(1:100,function(x) pin(22000)))
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
3.130   3.137   3.141   3.142   3.146   3.161
It could obviously be thought that the random numbers are not needed. An integration can be done. But that is much less fun.
integrate(function(x) 4*sqrt(1-x^2),0,1)
3.141593 with absolute error < 0.00016

## Sunday, February 14, 2016

### Confidence intervals for Proportions

Since I read documents with Clopper-Pearson a number of times the last weeks, I thought it a good idea to play around with confidence intervals for proportions a bit; to examine how intervals differ between various approaches. From a frequentist side Clopper-Pearson, which is described as the frequentist's gold standard and secondly the easy way normal approximation. From the Bayesian side, binomial with beta Beta prior. Obviously, the intervals have completely different interpretation in the frequentist and Bayesian framework, but that is a different discussion. There will be no data in this analysis, I am just making intervals based on possible results

### Code

There are many ways to set this up. I wanted some plots. My first approach; given an observed proportion of 'correct', how does the total of trials change the intervals? The second approach; given that a certain number of trials is done, how do the intervals change as the number correct changes?

Since I want to repeat many of these calculations, I first made some supporting functions. This is because I am trying to write more clear code, where as much as possible code is not repeated but rather delegated to some sort of function. That may not result in the shortest or fastest code, but at this point neither is required.

#### Intervals

The first functions create the intervals from n (observed) and N (total). Clopper-Pearson is extracted from binom.test(). Normal approximation is based on an internet example. Beta-Binomial has three functions, one for the actual work, two to set up the desired priors and adapt the naming. A final function calls all these.

clopper.pearson <- function(n,N,conf.level=0.95) {
limits <- as.numeric(binom.test(n,N,conf.level=conf.level)\$conf.int)
names(limits) <- c('cp_low','cp_high')
limits
}

binom.norm.app <- function(n,N,conf.level=0.95) {
# based on http://www.r-tutor.com/elementary-statistics/interval-estimation/interval-estimate-population-proportion
phat <- n/N
shat <- sqrt(phat*(1-phat)/N)
limit <- (1-conf.level)/2
zlim <- qnorm( c(limit,1-limit))*shat
limits <-  phat+zlim
names(limits) <- c('na_low','na_high')
limits
}

beta.binomial <- function(n,N,conf.level=0.95,prior=c(1,1)){
limit <- (1-conf.level)/2
limits <- qbeta(c(limit,1-limit),n+prior,N-n+prior)
names(limits) <- c('bb_low','bb_high')
limits
}
beta.binomial11 <- function(n,N,conf.level=0.95) {
limits <- beta.binomial(n,N,conf.level=conf.level,prior=c(1,1))
names(limits) <- c('bb11_low','bb11_high')
limits
}
beta.binomial.5.5 <- function(n,N,conf.level=0.95) {
limits <- beta.binomial(n,N,conf.level=conf.level,prior=c(0.5,0.5))
names(limits) <- c('bb.5_low','bb.5_high')
limits
}

all.intervals <- function(n,N,conf.level=0.95) {
c(n=n,N=N,conf.level=conf.level,
clopper.pearson(n,N,conf.level=conf.level),
binom.norm.app(n,N,conf.level=conf.level),
beta.binomial11(n,N,conf.level=conf.level),
beta.binomial.5.5(n,N,conf.level=conf.level))
}

#### Post processing

Just doing an sapply() on all.intervals() gives a matrix. It needs to be processed a bit to get a nice data.frame which ggplot likes. Hence a function in which it is transposed, reshaped and names of the intervals are split. Naming is adapted for display purposes.
postprocessing <- function(have1outN) {
have1outN <- as.data.frame(t(have1outN))
have1outN <- reshape(have1outN,
varying=list(names(have1outN)[-1:-3]),
idvar=c('n','N','conf.level'),
timevar='statistic',
times=names(have1outN)[-1:-3],
v.names='limit',
direction='long')

have1outN\$direction <-
sub('^.+_','',have1outN\$statistic)
have1outN\$Method <-
sub('_.+\$','',have1outN\$statistic)
have1outN\$Method[have1outN\$Method=='cp']<- 'Clopper Pearson'
have1outN\$Method[have1outN\$Method=='na']<-'Normal Approximation'
have1outN\$Method[have1outN\$Method=='bb.5']<- 'Beta Bionomial prior 0.5 0.5'
have1outN\$Method[have1outN\$Method=='bb11']<- 'Beta Bionomial prior 1 1'
have1outN
}

### Results

#### Results for a proportion correct

The codes are variations on this example for 50% correct. As most of the work is done in the supporting functions, there is no need to repeat the code:
have1outN <- sapply(1:20,function(x) all.intervals(1*x,2*x))
have1outN <- postprocessing(have1outN)
ggplot(have1outN,aes(x=limit,y=N,col=Method,l=direction)) +
geom_path() +
xlim(c(min(0,have1outN\$limit),max(1,have1outN\$limit))) +
ggtitle('Interval at 1/2 correct') +
theme(legend.position="bottom") +
guides(col=guide_legend(ncol=2))

It seems that especially at lower N the Normal approximation is not advisable. Having an interval stick outside the range 0-1 is obviously a dead giveaway that something is not correct. But even if that does not happen, the lines are pretty far of the remainder of the methods. The difference between the two Beta Binomials is surprisingly small and only visible when very few observations are made. Clopper-Pearson seems to give slightly wider intervals than Beta Binomial.

#### Results for a fixed N

Again, the code is variations on a theme, with the work being done by the supporting functions.

Again the normal approximation is the odd out. It also seems to degenerate at n=0 and n=N. Other than that the choice of prior in the Beta Binomial is more expressed that the previous plots.

## Tuesday, February 2, 2016

### Unemployment in Europe

A couple of years I have made plots of unemployment and its change over the years. At first this was a bigger and complex piece of code. As things have progressed, the code can now become pretty concise. There are just plenty of packages to do the heavy lifting. So, this year I tried to make the code easy to read and reasonably documented.

### Data

Data is from Eurostat. Since we have the joy of the Eurostat package, suffice to say this is dataset une_rt_m. Since the get_eurostat function gave me codes for things such as country and gender, the first step is to use a dictionary to decode. Subsequently, the country names are a bit sanitized and data is selected.
library(eurostat)
library(ggplot2)
library(KernSmooth)
library(plyr)
library(dplyr)

library(scales) # to access breaks/formatting functions

r1 <- get_eurostat('une_rt_m')%>%
mutate(.,geo=as.character(geo)) # character preferred for merge
r2 <- get_eurostat_dic('geo') %>%
rename(.,geo=V1) %>%
mutate(.,
# part of country name within braces removed
country=gsub('\\(.*\$','',V2),
country=gsub(' \$','',country),
country=ifelse(geo=='EA19',paste(country,'(19)'),country)) %>%
select(.,geo,country) %>%
right_join(.,r1) %>%
# keep only total, drop sexes
filter(.,sex=='T') %>%
# filter out old Euro area and keep only EU28 , EA19
filter(.,!grepl('EA..',geo)|  geo=='EA19') %>%
filter(.,!(geo %in% c('EU15','EU25','EU27')) ) %>%
# SA is seasonably adjusted
filter(.,s_adj=='SA') %>%
mutate(.,country=factor(country)) %>%
select(.,-sex,-s_adj)

### Plots

To make plots I want to have smoothed data. Ggplot will do this, but it is my preference to have the same smoothing for all curves, hence it is done before entering ggplot. There are a bit many countries, hence the number is reduced to 36, which are displayed in three plots of 3*4, for countries with low, middle and high maximum unemployment respectively. Two smoothers are applied, once for the smoothed data, the second for its first derivative. The derivative has forced more smooth, to avoid extreme fluctuation.

# add 3 categories for the 3 3*4 displays
r3 <- aggregate(r2\$values,by=list(geo=r2\$geo),FUN=max,na.rm=TRUE) %>%
mutate(.,class=cut(x,quantile(x,seq(0,3)/3),
include.lowest=TRUE,
labels=c('low','middle','high'))) %>%
select(.,-x) %>% # maxima not needed any more
right_join(.,r2)

#locpoly to make smooth same for all countries
Perc <- ddply(.data=r3,.variables=.(age,geo),
function(piece,...) {
piece <- piece[!is.na(piece\$values),]
lp <- locpoly(x=as.numeric(piece\$time),y=piece\$values,
drv=0,bandwidth=90)
sdf <- data.frame(Date=as.Date(lp\$x,origin='1970-01-01'),
sPerc=lp\$y,
age=piece\$age,
geo=piece\$geo,
country=piece\$country,
class=piece\$class)}
,.inform=FALSE
)

# locpoly for deriviative too
dPerc <- ddply(.data=r3,.variables=.(age,geo),
function(piece,...) {
piece <- piece[!is.na(piece\$values),]
lp <- locpoly(x=as.numeric(piece\$time),y=piece\$values,
drv=1,bandwidth=365/2)
sdf <- data.frame(Date=as.Date(lp\$x,origin='1970-01-01'),
dPerc=lp\$y,
age=piece\$age,
geo=piece\$geo,
country=piece\$country,
class=piece\$class)}
,.inform=FALSE
)
The plots are processed by subsection.
for (i in c('low','middle','high')) {
png(paste(i,'.png',sep=''))
g <- filter(Perc,class==i) %>%
ggplot(.,
aes(x=Date,y=sPerc,colour=age)) +
facet_wrap( ~ country, drop=TRUE) +
geom_line()  +
theme(legend.position = "bottom")+
ylab('% Unemployment') + xlab('Year') +
scale_x_date(breaks = date_breaks("5 years"),
labels = date_format("%y"))
print(g)
dev.off()
}
for (i in c('low','middle','high')) {
png(paste('d',i,'.png',sep=''))
g <- filter(dPerc,class==i) %>%
ggplot(.,
aes(x=Date,y=dPerc,colour=age)) +
facet_wrap( ~ country, drop=TRUE) +
geom_line()  +
theme(legend.position = "bottom")+
ylab('Change in % Unemployment') + xlab('Year')+
scale_x_date(breaks = date_breaks("5 years"),
labels = date_format("%y"))
print(g)
dev.off()
}

### Results

In general, things are improving, which is good news, though there is still ways to go. As always, Eurostat has a nice document are certainly more knowledgeable than me on this topic.

## Sunday, January 17, 2016

### A simple ANOVA

I was browsing Davies Design and Analysis of Industrial Experiments (second edition, 1967). Published by for ICI in times when industry did that kind of thing. It is quite an applied book. On page 107 there is an example where the variance of a process is estimated.

### Data

Data is from nine batches from which three samples were selected (A, B and C) and each a duplicate measurement. I am not sure about copyright of these data, so I will not reprint the data here. The problem is to determine the measurement ans sampling error in a chemical process.
ggplot(r4,aes(x=Sample,y=x))+
geom_point()+
facet_wrap(~  batch )

### Analysis

At the time of writing the book, the only approach was to do a classical ANOVA and calculate the estimates from there.
aov(x~ batch + batch:Sample,data=r4) %>%
anova
Analysis of Variance Table

Response: x
Df Sum Sq Mean Sq  F value  Pr(>F)
batch         8 792.88  99.110 132.6710 < 2e-16 ***
batch:Sample 18  25.30   1.406   1.8818 0.06675 .
Residuals    27  20.17   0.747
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In this case the residual variation is 0.75. The batch:Sample variation estimates is, due to the design, twice the sapling variation plus residual variation. Hence it is estimated as 0.33. How lucky we are to have tools (lme4) which can do this estimate directly. In this case, as it was a well designed experiment, these estimates are the same as from the ANOVA.
l1 <- lmer(x ~1+ (1 | batch) + (1|batch:Sample) ,data=r4 )

summary(l1)
Linear mixed model fit by REML ['lmerMod']
Formula: x ~ 1 + (1 | batch) + (1 | batch:Sample)
Data: r4

REML criterion at convergence: 189.4

Scaled residuals:
Min       1Q   Median       3Q      Max
-1.64833 -0.50283 -0.06649  0.55039  1.57801

Random effects:
Groups       Name        Variance Std.Dev.
batch:Sample (Intercept)  0.3294  0.5739
batch        (Intercept) 16.2841  4.0354
Residual                  0.7470  0.8643
Number of obs: 54, groups: batch:Sample, 27; batch, 9

Fixed effects:
Estimate Std. Error t value

(Intercept)   47.148      1.355    34.8
A next step is confidence intervals around the estimates. Davies uses limits from a Chi-squared distribution for the residual variation, leading to a 90% interval 0.505  to 1.25. In contrast lme4 has two estimators, profile (computing a likelihood profile and finding the appropriate cutoffs based on the likelihood ratio test;and bootstrap (perform parametric bootstrapping with confidence intervals computed from the bootstrap distribution according to boot.type). Each of these takes one or few second on my laptop, not feasible for the pre computer age. The estimates are different, to my surprise more narrow:
Computing profile confidence intervals ...
5 %       95 %
.sig01       0.0000000  0.9623748
.sig02       2.6742109  5.9597328
.sigma       0.7017849  1.1007261
(Intercept) 44.8789739 49.4173227

Computing bootstrap confidence intervals ...
5 %       95 %
sd_(Intercept)|batch:Sample  0.000000  0.8880414
sd_(Intercept)|batch         2.203608  5.7998348
sigma                        0.664149  1.0430984

(Intercept)                 45.140652 49.4931109
Davies continues to estimate the ratio to residual for sampling variation, which was the best available for that time. This I won't repeat.

## Sunday, January 3, 2016

### A plot of 'Who works at home'

I ran across this post containing displays on who works from home. I must say it looks great and is interactive but it did not help me understand the data. So I created this post to display the same data with a boring plot which might help me. For those really interested in this topic, census.gov created a .pdf which contains a full report with much more information than here.

### Data

Data is from census.gov. I have taken the first spreadsheet. It is one of those spreadsheets with counts and percentages and empty lines to display categories. Very nice to check some numbers, horrible to process. So, a bit of code to extract the numbers.
library(gdata)
r1 <- read.xls('2010_Table_1.xls',stringsAsFactors=FALSE)
# throw out percentages
r2 <- r1[,r1[4,]!='Percent']
# put all column names in one row
r2\$X.6 <- r2\$X.6
r2\$X.8 <- r2\$X.8
# select part with data
r3 <- r2[2:61,c(1,3,5,6)]
names(r3) <- r3[1,1]
r4 <-r3[c(-1:-3),]
#eliminate one row with mean income.
r4 <- r4[-grep('\$',r4[,2],fixed=TRUE),]
#reshape in long form
r5 <- reshape(r4,
varying=list(names(r4)[-1]),
v.names='count',
direction='long',
idvar='Characteristic',
timevar='class',
times=r3[1,2:4])
row.names(r5) <- 1:nrow(r5)
# remove ',' from numbers and make numerical values.
# units are in 1000, so update that too
r5\$count <- as.numeric(gsub(',','',r5\$count))*1000
# clean up numbers used for footnotes
r5\$class <- gsub('(1|2|3)','',r5\$class)
#some upfront '.' removed.
r5\$Characteristic <- gsub('^\\.+','',r5\$Characteristic)
# create a factor
r5\$Characteristic <- factor(r5\$Characteristic,
levels=rev(r5\$Characteristic[r5\$class=='Home Workers']))
# and create a higher level factor
r5\$Mchar=r5\$Characteristic
for (i in 1:nrow(r5)) r5\$Mchar[i] <-
if(is.na(r5\$count[i]) | r5\$Mchar[i]=='Total') r5\$Mchar[i]
else r5\$Mchar[i-1]

### Plot

The plot is made using old style graphics. I could not get either ggplot2 or lattice to provide the plot I wanted.
# prepare for axis labels
index <- subset(r5,r5\$class=='Home Workers',c(Characteristic,Mchar))
index\$y=56:1
index2 <- index[index\$Characteristic!=index\$Mchar | index\$Characteristic=='Total',]
index3 <- index[index\$Characteristic==index\$Mchar & index\$Characteristic!='Total',]

r6 <- merge(r5,index)
r6\$class <- factor(r6\$class)
par(mar=c(5,18,4,2)+.1,cex=.7)
plot(x=r6\$count,y=r6\$y,axes=FALSE,
xlab='Count',
ylab='',
col=c('red','green','blue')[r6\$class],
frame.plot=TRUE,
#   log='x',
ylim=c(2,58))
axis(1)
axis(2,at=index2\$y,labels=index2\$Characteristic,las=1)
text(y=index3\$y-.1,x=30000,labels=index3\$Characteristic,adj=0)
legend('topleft',legend=levels(r6\$class),
ncol=3,col=c('red','green','blue'),
border=NULL,pch=1,
yjust=0)

#### Why I did not use ggplot2?

The ideal solution for ggplot2 might look something like this:
r7 <- r5[!is.na(r5\$count),]
r7\$Mchar <- factor(r7\$Mchar,levels=unique(r7\$Mchar))
ggplot(data=r7,
aes(x=Characteristic,y=count,col=class)) +
geom_point()+
coord_flip()+
xlab('')+ylab('')+
ylim(0,max(r5\$count))+
facet_wrap(~Mchar,scales='free_x',ncol=2)+
theme(legend.position="bottom")
However, this throws an error:
Error in facet_render.wrap(plot\$facet, panel, plot\$coordinates, theme,  :
ggplot2 does not currently support free scales with a non-cartesian coord or coord_flip.
I also tried the system described here: http://wresch.github.io/2014/05/22/aligning-ggplot2-graphs.html, but I think width has changed in content, could not get that to be satisfactory.

library(gtable)
library(gridExtra)

tt <- as.data.frame(table(r7\$Mchar))
tt\$Var1
tt\$Freq <- tt\$Freq +15

la <- lapply(tt\$Var1,function(x) {
r8 <- r5[r5\$Mchar==as.character(x) ,]
r8 <- r8[ !is.na(r8\$count),]
ggplot(data=r8,
aes(x=Characteristic,y=count,col=class)) +
geom_point()+
coord_flip()+
xlab('')+ylab('')+
ylim(0,max(r5\$count))
})

# http://wresch.github.io/2014/05/22/aligning-ggplot2-graphs.html
lax <- lapply(la,function(x) x\$widths[2:3])
maxwidths <- do.call(grid::unit.pmax,lax)
for(i in 1:12) la[[i]]\$widths <- as.list(maxwidths)

la[] <- la[] +
theme(legend.position="bottom",
plot.margin = unit(c(0.01, 0.1, 0.02, 0.1), "null"))
for (i in 1:11) la[[i]] <- la[[i]] +
theme(legend.position="none",
axis.text.x = element_blank(),
axis.title.x = element_blank(),
axis.ticks.x = element_blank(),
plot.margin = unit(c(0.01, 0.1, 0.02, 0.1), "null"))

lag <- lapply(la,ggplotGrob)

g <- gtable_matrix(name = "demo",
grobs = matrix(lag, nrow = 12),
widths = unit(9, "null"),
heights = unit(tt\$Freq, "null"))

grid.newpage()
grid.draw(g)