Wiekvoet: August 2012

Monday, August 27, 2012

Football (Eredivisie) goals

The football season has started in Netherlands, so I went and had a look at last year's scores. I did not find downloadable data, at http://www.eredivisiestats.nl/wedstrijden.php I could copy last season's data and paste into a spreadsheet. The default layout did not look good to me for statistical analysis (first lines shown below), so first part is reformatting to get how many goals a club did against another. This means that each game becomes two lines and an indicator variable is used for who was playing at home.

Datum	Thuisclub (home)	Uitclub (away)	Thuisscore	Uitscore
2011-08-05	Excelsior	Feyenoord	0	2
2011-08-06	RKC Waalwijk	Heracles Almelo	2	2
2011-08-06	Roda JC	FC Groningen	2	1

# read data
old <- read.csv('seizoen1112.csv',stringsAsFactors=FALSE)
# reformat lengthwise
StartData <- with(old,data.frame(
OffenseClub = c(Thuisclub,Uitclub),
DefenseClub = c(Uitclub,Thuisclub),
Goals = c(Thuisscore,Uitscore),
OffThuis = rep(c(1,0),each=length(Datum))
))
#remove space as first character in names
levels(StartData$OffenseClub) <- sub('^ ','',levels(StartData$OffenseClub))
levels(StartData$DefenseClub) <- sub('^ ','',levels(StartData$DefenseClub))
# check clubs have same name and ordening in both factors.
all(levels(StartData$OffenseClub)==levels(StartData$DefenseClub))

Distribution of goals

To get to know the data the distribution of goals is most interesting:

xt <- xtabs(~Goals ,data=StartData)

plot(xt)

The first question is; is this approximately Poisson distributed? MASS has some functions.

library(MASS)

(fd <- fitdistr(StartData$Goals, densfun="Poisson"))

     lambda  

  1.62908497 

 (0.05159364)

round(dpois(0:8,coef(fd))*sum(xt))

plot(xt,ylim=c(0,200),ylab='Frequency of goals')

points(x=0:8,y=dpois(0:8,coef(fd))*sum(xt),col='green')

It is also possible to do the same calculations with the fitdistrplus package and get some extra information. Two different algorithms give essentially the same information.

library(fitdistrplus)

(fd2 <- fitdist(StartData$Goals, distr='pois',method='mle'))

Fitting of the distribution ' pois ' by maximum likelihood

Parameters:

estimate Std. Error

lambda 1.629085 0.05159362

fitdist(StartData$Goals, distr='pois',method='mme')

Fitting of the distribution ' pois ' by matching moments

Parameters:

estimate

lambda 1.629085

plot(fd2)

This is nice. I get the same information as my own plot, but also a CDF and have to do less. If only I knew all packages in R, I would not have to program anything. But there are more extra's in fitdistrplus. Goodness of fit and bootstrap confidence interval.

gofstat(fd2,print.test=TRUE)

Chi-squared statistic: 18.46365

Degree of freedom of the Chi-squared distribution: 4

Chi-squared p-value: 0.001001433

bd <- bootdist(fd2)

summary(bd)

Parametric bootstrap medians and 95% percentile CI

Median 2.5% 97.5%

1.627451 1.527803 1.728736

plot(density(bd$estim$lambda))

Finally, if desired the same can be done the Bayesian way via Jags

library(R2jags)

calcData <- with(StartData,list(N=length(Goals),Goals=Goals))

mijnmodel <- function() {

for (i in 1:N) {

Goals[i] ~ dpois(mu)

}

mu ~ dunif(0,3)

}

write.model(mijnmodel,'model.file')

jags.inits <- function() {

list(

mu = runif(1,0,3)

)

}

jags.params=c('mu')

jags(data=calcData, inits=jags.inits, jags.params,model.file='model.file')

Inference for Bugs model at "model.file", fit using jags,

3 chains, each with 2000 iterations (first 1000 discarded)

n.sims = 3000 iterations saved

mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff

mu 1.633 0.051 1.533 1.598 1.632 1.667 1.732 1.001 2100

devianc 2048.287 1.338 2047.308 2047.403 2047.757 2048.671 2052.118 1.006 1500

For each parameter, n.eff is a crude measure of effective sample size,

and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)

pD = 0.9 and DIC = 2049.2

DIC is an estimate of expected predictive error (lower deviance is better).

This leads to essentially the same result as before.

Conclusion

Three ways to get parameters of a distribution are used. The most obvious way, via MASS gives parameters and standard deviation. A dedicated package gives all that plus significance and distribution. The Bayesian way gives the distribution too and uses most code by far. The estimates do not differ in any relevant way.

Tuesday, August 14, 2012

Random and fixed effects in sensory profiling

I am reading Introduction into mixed modelling by N.W. Galway. It is partly a repeat of things I know, but I expect to use mixed models quite a lot the coming time, so it is good to repeat these things.
My problem with this book is a sensory example in chapter 2. It is profiling data, but with some twists, as happens in real life. He also makes some odd choices.

Design

The test concerns four hot products (ravioli). Because the products are hot, it was found not feasible to use a proper balanced design within each day (e.g. Williams designs). Within a day each presentation (he used presentation for rounds) has one product, as shown in the table below. Which is fine in a way, I do believe in making designs which be executed in practice. However, I come from a facility where we would at least have tried to solve this with the 'au-bain-marie'.

Table 1, allocation of products A to D over presentations and days.

Day	Presentation 1	Presentation 2	Presentation 3	Presentation 4
1	B	A	C	D
2	C	D	B	A
3	A	C	B	D

Within each presentation the nine assessors get served portions. It was randomized though not registered who got what serving. Even so, a random variable was created to serve as proxy. I understand this does not make a difference if servings are nested in presentations and days. Still, it removes the option of actually looking at this effect except as nested within servings and days. I can actually imagine that the product served first is different, especially in case of sloppy sensory practices, so it would be nice to be able to check this.

Model

The following effects were used in the model(s)
Table 2. Allocation of effects

Effect	Allocation
Day	Random
Day.Presentation	Random
Day.Presentation.Serving	Random
Product	Fixed
Assessor	Fixed
Product.Assessor	Fixed

Here I got my ax to grind.

Product.Assessor. To quote: '(assessor) ANA perceived Brand B as the least salty, whereas (assessor) GUI perceived Brand A as the least salty. Such crossover effects may be important: they suggest an obstacle to designing a brand that will please all consumers'. In my (industry) experience the sensory panel may well be in a different country or even continent as the target market, so that is a bit too strong. Besides, nine assessors is a bit low to segment groups on. Typically segmenting would be done with a consumer group, say 120 persons, in the target market. In my view sensory is intended to provide an objective measurement. The assessors are representing what a typical human may taste, and are hence random. With assessors random, product.assessor can only be found random too.
Day.Presentation. To quote: 'Presentation 1 on Day 1 is not the the same as presentation 1 on day 2' and 'it would therefore not be meaningful to obtain the mean for presentation 1 over days'. Actually, presentation 1 is the same on day 1 as on day 2 . It is the tasting with a clean palette. While palette cleansing is supposed to make all presentations equal, that does not mean it really does. After all, this is on the trade-off between production (shorter cleaning time) and correctness (longer cleaning time). Looking at this effect is important. I would make it fixed. For round 1 specifically, it represents monadic tasting as in a consumer test. Surely a first impression is important to understand consumer liking, when correlating consumer data and sensory data I might look at solely presentation 1. Having said that, given the design in table 1, I would make presentation random. The information is not there to do anything else.

Conclusion

I dislike the design, the sensory foundation and interpretation. However, if you ignore that sensory based dislike, the model actually makes sense and is a nice example. On top of that, the book has R code using lme (nlme package), with code on the download site http://www.wiley.com/legacy/wileychi/mixed-modelling/ has both nlme and lme4 examples, so is up to date. I am looking forward to reading the next chapters.

Wednesday, August 1, 2012

Trying Julia

In my previous post I tried building Williams designs in R. Since that code was running a bit slow, this was an ideal test for Julia. Big enough to be at least slightly realistic, small enough that it is doable.
I am very impressed. Almost twenty fold speed increase, even though this was the best I could do in R, the most naive way possible in Julia.
R Julia Ratio
Double Williams design 4, 100 times 3.97 sec 0.212 sec 0.053
Williams design 5, 10 times 289.5 sec 18.54 sec 0.064

I'd surely love to use Julia more. For instance, if I could port some of the algorithm's of Professor Ng Machine Learning class to Julia? I don't think it is possible yet, but what is not, may come.

Julia code

macro timeit(ex,name,num)

quote

t0 = 0

for i=1:$num

t0 = t0 + @elapsed $ex

end

println("julia,", $name, ",", t0)

#gc()

end

function gendesign(ncol)

nrow=ncol*2

desmat = zeros(Uint8,nrow,ncol)

desmat[1,1:ncol] = [1:ncol]

for i = [1:ncol]

desmat[2*i-1,1] = i

desmat[2*i,1]=i

end

carover = zeros(Uint8,ncol,ncol)

for i = 1:(ncol-1)

carover[i,i+1] = 1

end

count = 0

addpoint( desmat , carover,count)

end

function numzero(matin)

length(matin) - nnz(matin)

end

function first0(matin)

if numzero(matin)==0

return -1

end

nrow, ncol = size(matin)

for row = 1:nrow

for col = 1:ncol

if matin[row,col] == 0

return row, col

end

function addpoint(desmat,carover,count)

if nnz(desmat) == length(desmat)

count +=1

print("x")

return count

end

row,col = first0(desmat)

for i = [1:size(desmat,2)]

if numzero(desmat[row,:]-i) == 0

if numzero(desmat[:,col]-i) < 2

if carover[desmat[row,col-1],i] < 2

if (col !=2) | (desmat[row,1] != desmat[row-1,1]) | (desmat[row-1,col] < i)

desmat[row,col]=i

carover[desmat[row,col-1],i] +=1

count = addpoint(desmat,carover,count)

desmat[row,col]=0

carover[desmat[row,col-1],i] -=1

end

return count

end

@timeit gendesign(4) "design 4 " 100

@timeit gendesign(5) "design 5 " 10

Note

Both designs were created using the same script. If you ask an even number of design points using the odd algorithm, then this will work. You just get a design with double the rows, carry over balanced, every treatment equally often in each row and each column.