Wiekvoet: January 2014

Sunday, January 26, 2014

Converting plots to data II

There were quite some reactions on lasts week's post converting plots to data. Two additional programs were mentioned; WebPlotDigitzier and DataThief. Note that google or wikipedia has a bunch of alternatives. Datathief is shareware, a java program, so can run on Mac, Windows and Linux. WebPlotDigitizer is, as one might guess, web based. Since these two were mentioned, I thought to try them.

DataThief

DataThief works well. It is shareware, let me wait for 6 seconds before running. I found the axis definition non intuitive, the points are basically on the screen, you just have to shift them to the correct locations. It is just with the high resolution figure they were all top left. Then with the low resolution figure they were outside my screen. The points you want to add you drag from a small floating menu to the correct location. Or shift again if you are not right at once. You want to delete them, drag them back. That is, if you switch on the data dump menu item. For importing the data in R, I found it most easy to remove the top line from the .csv. For practical usage, it is close to PlotDigitizer which I examined last week.

WebPlotDigitizer

I found it again easy to use. Define axis, add points, remove points. Export to .csv. A zoom window so you can see in detail where you are. It has an auto mode which I like, but have to conclude that my current figure is not suitable.
In a corporate world web based is a winning thing. IT and self installed software is a no go area. Web based avoids the problem. I do not know if that extends to IE6. There is a Chrome app, which should work. I used Firefox, which does work. If I need to digitize a plot at the office, WebPlotDigitizer will be my tool of choice.

Results

Results will very much depend on the precision with which axis are defined, which can cause a bias. From practical point of view, when I combined all data, they were all on top of each other. The plot below has all four sets, using red, green, blue and black. Except for top-right, where quite of top of each other. Obviously as one comment stated, if the original plot contains an error, so does any extraction.

Saturday, January 18, 2014

Converting plots to data

It is a problem which occurs ever so often in applied work, you have a plot, but you want the data. There are at least two programs which can help you there; PlotDigitizer and Engauge Digitizer. I got both on my openSuse machine. Both are available for Windows, for Mac there are only older versions of Engauge.

I tried these programs on a relatively simple problem. I saw a plot in a book and wanted to calculate that line myself. So I took my camera, photographed the plot and got to work.

Engauge Digitizer

Engauge has been there for quite a while. It is many features, but looks a bit outdated. It was not able to import my original figure (2992*2992 pixels, 694 KB) but had no problems after resizing to 500*500 pixels, 55.9 KB.
It is clearly the program which can handle more exotic plots. For me it is not intuitive. For instance, it took me quite some time to figure out how to export the results. Initially I copied-pasted the results to a spreadsheet, later I managed to create a .csv after all. Engauge comes with a manual so everything can be resolved. Engauge has the ability to do point detection, to use that it is probably best to crop the figure as much as possible, Engauge has no qualms finding points in text, black blobs, axis labels and such. Probably in a colored plot automatic detection would work better, you have some settings to guide it.

PlotDigitizer

PlotDigitizer looks much more modern. It had no problems with the large photo, except that it could not scale that photo enough to fit on the screen. The modern interface allows manual adding/removing/moving of points. There is also a possibility to trace a line on screen and it will add points it detects there. PlotDigitizer exports to .xml. It is also possible to cipy-paste the results. While I see the advantage of a file including documentation, it would also be nice to get the data out of the file.

The file I got needed some extra processing before I had the data.frame.
library(XML)
mytree <- xmlTreeParse('test12.xml')
mylist <- xmlToList(mytree)
mylist2 <- mylist[4:length(mylist)]
mydf <- do.call(rbind,mylist2)
convert <- data.frame(x=as.numeric(mydf[,'dx']),
y=as.numeric(mydf[,'dy']))

Conclusion

The programs complement each other. Engauge is great for automated extraction, complex plots. However, it is not so easy for occasional usage. PlotDigitizer is easy to use, great if you want to manually select your points.

Saturday, January 11, 2014

Converting a JAGS model to STAN

For my first experience with STAN I wanted to convert my last JAGS program into STAN. This was a bit more difficult than I expected. The JAGS program was Fe concentration in rainwater including values below detection level.

Data

Data has been explained before. The only difference is that by now I am reading the data using read.xls from the gdata package. The new code is at the bottom of this post.

Converting the code

Initially I thought the conversion would be simple. I made some simplification and the code worked. However, it appeared my initial code was not very fast. Then I added the out of range data and that resulted in a problem seen before, out of range initialized data and hence no MCMC samples. After muddying along for a bit my second step was to make a simple regression model and get to know STAN a bit better. This knowledge set in a simple model. Extended it till the model had the features which I wanted except the out of range data. To add the out of range data I decided to initialize the more complex model using estimates from a more simple model. This worked well, the first code also serves as a kind of warm-up too. There are some warnings the way I set it up, but that is only in the first model, which is only run for a few cycles. Final running time was a disappointing close to two hours. This could probably be reduced a bit using less but longer chains.

learned

STAN is not so similar to JAGS that you can drop in code, it is better to start a program fresh than try a quick convert
Using normal(0,1) and a multiplication factor f gives faster code than using normal(0,f). It says so in the manual and yes it does make a difference
Having aliased parameters slows things down
STAN's compilation step makes for slower development than JAGS but it is still best to start small and extend the model

at some point I just worked on code with few samples and two chains, which helped a bit.

the STAN code does not feel faster than JAGS code

Run results

output during fitting first model

TRANSLATING MODEL 'Fe_code1' FROM Stan CODE TO C++ CODE NOW.
DIAGNOSTIC(S) FROM PARSER:

Warning (non-fatal): sampling statement (~) contains a transformed parameter or local variable.
You must increment lp__ with the log absolute determinant of the Jacobian of the transform.
Sampling Statement left-hand-side expression:
          LAmountC ~ normal_log(...)

COMPILING THE C++ CODE FOR MODEL 'Fe_code1' NOW.
SAMPLING FOR MODEL 'Fe_code1' NOW (CHAIN 1).

Informational Message: The current Metropolis proposal is about to be rejected becuase of the following issue:
Error in function stan::prob::normal_log(N4stan5agrad3varE): Scale parameter[0] is 0:0, but must be > 0!
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

Iteration: 1 / 75 [ 1%] (Warmup)
Informational Message: The current Metropolis proposal is about to be rejected becuase of the following issue:
Error in function stan::prob::normal_log(N4stan5agrad3varE): Location parameter[0] is inf:0, but must be finite!
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

Iteration: 7 / 75 [ 9%] (Warmup)
Iteration: 14 / 75 [ 18%] (Warmup)
Iteration: 21 / 75 [ 28%] (Warmup)
Iteration: 28 / 75 [ 37%] (Warmup)
Iteration: 35 / 75 [ 46%] (Warmup)
Iteration: 42 / 75 [ 56%] (Warmup)
Iteration: 49 / 75 [ 65%] (Warmup)
Iteration: 56 / 75 [ 74%] (Sampling)
Iteration: 63 / 75 [ 84%] (Sampling)
Iteration: 70 / 75 [ 93%] (Sampling)
Iteration: 75 / 75 [100%] (Sampling)
Elapsed Time: 55.0709 seconds (Warm-up)
              56.8683 seconds (Sampling)
              111.939 seconds (Total)

[Etc chains 2 to 4]

fit1

Inference for Stan model: Fe_code1.
4 chains, each with iter=75; warmup=50; thin=1;
post-warmup draws per chain=25, total post-warmup draws=100.

             mean se_mean sd   2.5%    25%    50%    75% 97.5% n_eff Rhat
intercept     0.0     0.0 0.1   -0.1    0.0    0.0    0.1    0.1    46 1.1
rate         -0.4     0.0 0.0   -0.5   -0.4   -0.4   -0.4   -0.3    45 1.1
FLoc_r[1]    -0.2     0.0 0.3   -0.8   -0.5   -0.2    0.0    0.4    58 1.1
FLoc_r[2]     1.4     0.1 0.4    0.8    1.1    1.3    1.6    2.2    25 1.1
FLoc_r[3]    -0.3     0.1 0.3   -0.9   -0.5   -0.3   -0.1    0.2    37 1.1
.
...output lines deleted
.
lp__       4194.4     0.9 4.4 4186.6 4190.8 4194.9 4197.5 4202.4    26 1.2

Samples were drawn using NUTS(diag_e) at Thu Jan 9 19:54:38 2014.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

Running second model

TRANSLATING MODEL 'Fe_code2' FROM Stan CODE TO C++ CODE NOW.
COMPILING THE C++ CODE FOR MODEL 'Fe_code2' NOW.
SAMPLING FOR MODEL 'Fe_code2' NOW (CHAIN 1).

Iteration:   1 / 500 [ 0%] (Warmup)
Informational Message: The current Metropolis proposal is about to be rejected becuase of the following issue:
Error in function stan::prob::normal_log(N4stan5agrad3varE): Location parameter[0] is inf:0, but must be finite!
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Iteration: 50 / 500 [ 10%] (Warmup)
Iteration: 100 / 500 [ 20%] (Warmup)
Iteration: 150 / 500 [ 30%] (Sampling)
Iteration: 200 / 500 [ 40%] (Sampling)
Iteration: 250 / 500 [ 50%] (Sampling)
Iteration: 300 / 500 [ 60%] (Sampling)
Iteration: 350 / 500 [ 70%] (Sampling)
Iteration: 400 / 500 [ 80%] (Sampling)
Iteration: 450 / 500 [ 90%] (Sampling)
Iteration: 500 / 500 [100%] (Sampling)
Elapsed Time: 322.58 seconds (Warm-up)
              1212.62 seconds (Sampling)
              1535.2 seconds (Total)

[etc chains 2 to 4]

fit2

Inference for Stan model: Fe_code2.
4 chains, each with iter=500; warmup=100; thin=1;
post-warmup draws per chain=400, total post-warmup draws=1600.

             mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff
intercept     0.2     0.0 0.1     0.1     0.2     0.2     0.3     0.4   562
yearlydec     0.9     0.0 0.0     0.9     0.9     0.9     0.9     0.9   601
sigmaF[1]     0.5     0.0 0.0     0.5     0.5     0.5     0.5     0.5   394
.
...output lines deleted
.
lp__      -1836.8     3.6 56.4 -1944.1 -1876.3 -1830.9 -1802.3 -1723.9   245
          Rhat
intercept 1.0
yearlydec 1.0
sigmaF[1] 1.0
.
.output lines deleted
.
lp__       1.0

Samples were drawn using NUTS(diag_e) at Thu Jan 9 21:32:53 2014.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

Code

Reading data

# http://www.lml.rivm.nl/data/gevalideerd/data/lmre_1992-2005.xls

# http://www.r-bloggers.com/read-excel-files-from-r/

library(gdata)

readsheet <- function(cursheet,fname) {

df = read.xls(fname,

sheet = cursheet ,

blank.lines.skip=FALSE,

colClasses = "character")

topline <- 8

test <- which(df[6,]=='C')+1

numin <- df[topline:nrow(df),test]

names(numin) <- make.names( gsub(' ','',df[6,test]))

for(i in 1:ncol(numin)) numin[,i] <- as.numeric(gsub(',','.',numin[,i]))

status = df[topline:nrow(df),test-1]

names(status) <- paste('C',make.names( gsub(' ','',df[6,test])),sep='')

numin$StartDate <- as.Date(substr(df[topline:nrow(df),1],1,10))

numin$EndDate <- as.Date(substr(df[topline:nrow(df),2],1,10))

numin <- cbind(numin,status)

numin <- numin[!is.na(numin$StartDate),]

tall <- reshape(numin,

varying=list(

make.names(gsub(' ','',df[6,test])),

paste('C',

make.names(gsub(' ','',df[6,test])),sep='')),

v.names=c('Amount','Status'),

timevar='Compound',

idvar=c('StartDate','EndDate'),

times=paste(df[6,test],'(',df[5,test],')',sep=''),

direction='long')

tall$Compound <- factor(gsub(' )',')',gsub(' +',' ',tall$Compound)))

row.names(tall)<- NULL

tall$Location <- cursheet

tall

}

readfile <- function(fname) {

sheets <- sheetNames(fname)[-1]

tt <- lapply(sheets,readsheet,fname=fname)

tt2 <- do.call(rbind,tt)

tt2$Location <- factor(tt2$Location)

tt2$fname <- fname

tt2

}

fnames <- dir(pattern='*.xls') #.\\. ?

rf <- lapply(fnames,readfile)

rf2 <- do.call(rbind,rf)

rf2$machine <- factor(ifelse(rf2$fname=="lmre_1992-2005.xls",

'Van Essen/ECN vanger',

' Eigenbrodt NSA 181 KHT'))

Select Fe data and prepare

Fe <- rf2[rf2$Compound==levels(rf2$Compound)[9],]

Fe$LAmount <- log10(Fe$Amount)

Fe$LAmount[Fe$Amount<0.4] <- NA

Fe2 <- Fe[!is.na(Fe$Status),]

Fe3 <- Fe2[!is.na(Fe2$LAmount),]

Fe4 <- Fe2[is.na(Fe2$LAmount),]

library(rstan)

datain <- list(

LAmount = Fe3$LAmount,

Machine=(Fe3$machine=='Van Essen/ECN vanger')+1,

Location=(1:nlevels(Fe3$Location))[Fe3$Location],

time=as.numeric(Fe3$StartDate),

nobs =nrow(Fe3),

nloc=nlevels(Fe3$Location),

MachineC=(Fe4$machine=='Van Essen/ECN vanger')+1,

LocationC=(1:nlevels(Fe4$Location))[Fe4$Location],

timeC=as.numeric(Fe4$StartDate),

nobsC =nrow(Fe4),

L=log10(0.399)

)

First model

parameters1 <- c('intercept','rate','FLoc_r',

'sigmaF','sfmach','sfloc','FMach_r' )

Fe_code1 <- '

data {

int<lower=0> nloc;

int<lower=0> nobs;

vector[nobs] LAmount;

int<lower=1,upper=2> Machine[nobs];

int<lower=1,upper=nloc> Location[nobs];

vector[nobs] time;

int<lower=0> nobsC;

real <upper=min(LAmount)> L;

int<lower=1,upper=2> MachineC[nobsC];

int<lower=1,upper=nloc> LocationC[nobsC];

vector[nobsC] timeC;

}

parameters {

real intercept;

real rate;

vector [nloc] FLoc_r;

vector<lower=0> [2] sigmaF;

real<lower=0> sfmach;

real<lower=0> sfloc;

vector[2] FMach_r;

}

transformed parameters {

vector[nobs] ExpLAmount;

vector<lower=0>[nobs] sigma;

real mean_FLoc;

real mean_FMach;

vector[nloc] FLoc;

vector[2] FMach;

vector[nobsC] ExpLAmountC;

vector<lower=0> [nobsC] sigmaC;

vector[nobsC] LAmountC;

mean_FLoc <- mean(FLoc_r);

FLoc <- (FLoc_r-mean_FLoc)*sfloc;

mean_FMach <- mean(FMach_r);

FMach <- (FMach_r-mean_FMach)*sfmach;

for (i in 1:nobs) {

ExpLAmount[i] <- intercept

+ .0001*rate*time[i]

+ FMach[Machine[i]] +

+ FLoc[Location[i]];

sigma[i] <- sigmaF[Machine[i]];

}

for (i in 1:nobsC) {

ExpLAmountC[i] <- intercept

+ .0001*rate*timeC[i]

+ FMach[MachineC[i]] +

+ FLoc[LocationC[i]];

sigmaC[i] <- sigmaF[MachineC[i]];

LAmountC[i] <- log10(.2);

}

model {

sfmach ~ cauchy(0,4);

sfloc ~ cauchy(0,4);

FMach_r ~ normal(0,1);

FLoc_r ~ normal(0,1);

sigmaF ~ cauchy(0,4);

LAmount ~ normal(ExpLAmount,sigma);

LAmountC ~ normal(ExpLAmountC,sigmaC);

rate ~ normal(0,1);

}

generated quantities {

real dailydec;

real yearlydec;

dailydec <- pow(10,rate*.0001);

yearlydec <- pow(dailydec,365);

}

fit1 <- stan(model_code = Fe_code1,

data = datain,

pars=parameters1,

warmup=50,

iter = 75,

chains = 4)

fit1

Second model

samples <- as.array(fit1)

lastsample <- dim(samples)[1]

nchain <- dim(samples)[2]

init2 <- lapply(1:nchain,function(i) {

fin <- samples[lastsample,i,]

list(

intercept=fin[1],

rate=fin[2],

FLoc_r=fin[3:(3+16)],

sigmaF=fin[20:21],

sfmach=fin[22],

sfloc=fin[23],

FMach_r=fin[24:25],

LAmountC=rep(log10(.2),datain$nobsC)

)

})

parameters2 <- c('intercept','yearlydec',

'sigmaF','FMach','sfmach','sfloc',

'FLoc' ,'rate' )

Fe_code2 <- '

data {

int<lower=0> nloc;

int<lower=0> nobs;

vector[nobs] LAmount;

int<lower=1,upper=2> Machine[nobs];

int<lower=1,upper=nloc> Location[nobs];

vector[nobs] time;

int<lower=0> nobsC;

real <upper=min(LAmount)> L;

int<lower=1,upper=2> MachineC[nobsC];

int<lower=1,upper=nloc> LocationC[nobsC];

vector[nobsC] timeC;

}

parameters {

real intercept;

real rate;

vector [nloc] FLoc_r;

vector<lower=0> [2] sigmaF;

real<lower=0> sfmach;

real<lower=0> sfloc;

vector[2] FMach_r;

vector<upper=L>[nobsC] LAmountC;

}

transformed parameters {

vector[nobs] ExpLAmount;

vector<lower=0>[nobs] sigma;

real mean_FLoc;

real mean_FMach;

vector[nloc] FLoc;

vector[2] FMach;

vector[nobsC] ExpLAmountC;

vector<lower=0> [nobsC] sigmaC;

mean_FLoc <- mean(FLoc_r);

FLoc <- (FLoc_r-mean_FLoc)*sfloc;

mean_FMach <- mean(FMach_r);

FMach <- (FMach_r-mean_FMach)*sfmach;

for (i in 1:nobs) {

ExpLAmount[i] <- intercept

+ .0001*rate*time[i]

+ FMach[Machine[i]] +

+ FLoc[Location[i]];

sigma[i] <- sigmaF[Machine[i]];

}

for (i in 1:nobsC) {

ExpLAmountC[i] <- intercept

+ .0001*rate*timeC[i]

+ FMach[MachineC[i]] +

+ FLoc[LocationC[i]];

sigmaC[i] <- sigmaF[MachineC[i]];

}

model {

sfmach ~ cauchy(0,4);

sfloc ~ cauchy(0,4);

FMach_r ~ normal(0,1);

FLoc_r ~ normal(0,1);

sigmaF ~ cauchy(0,4);

LAmount ~ normal(ExpLAmount,sigma);

LAmountC ~ normal(ExpLAmountC,sigmaC);

rate ~ normal(0,1);

}

generated quantities {

real dailydec;

real yearlydec;

dailydec <- pow(10,rate*.0001);

yearlydec <- pow(dailydec,365);

}

fit2 <- stan(model_code = Fe_code2,

data = datain,

pars=parameters2,

init=init2,

warmup=100,

iter = 500,

chains = 4)

fit2

Saturday, January 4, 2014

Le Monde puzzle 847 in Julia

This week I wanted to play around with Julia and exporting the results. I found http://xianblog.wordpress.com/2013/12/29/le-monde-puzzle-847/ to be just the right size to play around with.

Code

A function to check if a triplet has the desired property
In [1]:

function lemonde847(xx)
   a=[xx[1],2]
   b=[xx[2],8]
   c=[xx[3],4]
  sum(kron(c,kron(a,b)))
end

Out[1]:

lemonde847 (generic function with 1 method)

Just a check - the function does indeed result in 1768 for the triplet 6 5 13.
In [2]:

lemonde847([6,5,13])

Out[2]:

Check all combinations of numbers 1 to 60 excluding 2, 4 and 8. These need to be permuted later on. 60 may be a bit much, but how to calculate a bound? After all, imagine a set 1 3 60, that is only 1428. On the other hand, a combination with 59 and 60 should be too large. Hence we make a filter to exclude some. The rows where all three test numbers multiplied are higher than 1768. The remaining combinations are permuted and tested.
In [3]:

lemonde847([1,60,3])

Out[3]:

In [4]:

totest = collect(combinations( symdiff(1:50, [2,4,8]),3))

Out[4]:

16215-element Array{Array{Int64,1},1}:
 [1,3,5]   
 [1,3,6]   
 [1,3,7]   
 [1,3,9]   
 [1,3,10]  
 [1,3,11]  
 [1,3,12]  
 [1,3,13]  
 [1,3,14]  
 [1,3,15]  
 [1,3,16]  
 [1,3,17]  
 [1,3,18]  
 ⋮         
 [45,48,50]
 [45,49,50]
 [46,47,48]
 [46,47,49]
 [46,47,50]
 [46,48,49]
 [46,48,50]
 [46,49,50]
 [47,48,49]
 [47,48,50]
 [47,49,50]
 [48,49,50]

In [5]:

for i in 1:size(totest)[1] 
    if prod(totest[i,])<1768
        test2 = collect(permutations(totest[i,])) 
        for j in 1:6 
            if lemonde847(test2[j,])==1768
               print(test2[j,]')
            end
        end
    end
end
6       5       13

Seems I found the answer; 6 5 13. The only thing left is to export the results. A bit of googling showed http://ipython.org/ipython-doc/stable/interactive/nbconvert.html.

Final Notes

Conversion

In the end I converted the script via:
ipython nbconvert lemonde847.ipynb
The resulting file was opened in libreoffice and copied pasted into blogger. This gave a bit better results than conversion using
ipython nbconvert --template basic lemonde847.ipynb
and a copy paste of the .html directly in blogger.
Some post editing was done after examination of the preview of the post, but that is fairly normal for me.

Time usage

The calculation time was minimal. Most of the time was used in understanding Julia syntax, debugging, writing code, writing blog text, installation of pandoc. A lot of these things are one-off, but still, this last sentence took more time than the computation time.