Monday, August 27, 2012

Football (Eredivisie) goals

The football season has started in Netherlands, so I went and had a look at last year's scores. I did not find downloadable data, at I could copy last season's data and paste into a spreadsheet. The default layout did not look good to me for statistical analysis (first lines shown below), so first part is reformatting to get how many goals a club did against another. This means that each game becomes two lines and an indicator variable is used for who was playing at home.

Datum Thuisclub (home) Uitclub (away) Thuisscore Uitscore
2011-08-05 Excelsior Feyenoord 0 2
2011-08-06 RKC Waalwijk Heracles Almelo 2 2
2011-08-06 Roda JC FC Groningen 2 1

# read data
old <- read.csv('seizoen1112.csv',stringsAsFactors=FALSE)
# reformat lengthwise
StartData <- with(old,data.frame(
OffenseClub = c(Thuisclub,Uitclub),
DefenseClub = c(Uitclub,Thuisclub),
Goals = c(Thuisscore,Uitscore),
OffThuis = rep(c(1,0),each=length(Datum))
#remove space as first character in names
levels(StartData$OffenseClub) <- sub('^ ','',levels(StartData$OffenseClub))
levels(StartData$DefenseClub) <- sub('^ ','',levels(StartData$DefenseClub))
# check clubs have same name and ordening in both factors.

Distribution of goals

To get to know the data the distribution of goals is most interesting:
xt <- xtabs(~Goals ,data=StartData)
The first question is; is this approximately Poisson distributed? MASS has some functions.
(fd <- fitdistr(StartData$Goals, densfun="Poisson"))
plot(xt,ylim=c(0,200),ylab='Frequency of goals')
It is also possible to do the same calculations with the fitdistrplus package and get some extra information. Two different algorithms give essentially the same information.
(fd2 <- fitdist(StartData$Goals, distr='pois',method='mle'))
Fitting of the distribution ' pois ' by maximum likelihood 

       estimate Std. Error
lambda 1.629085 0.05159362

fitdist(StartData$Goals, distr='pois',method='mme')
Fitting of the distribution ' pois ' by matching moments 
lambda 1.629085
This is nice. I get the same information as my own plot, but also a CDF and have to do less. If only I knew all packages in R, I would not have to program anything. But there are more extra's in fitdistrplus. Goodness of fit and bootstrap confidence interval.
Chi-squared statistic:  18.46365 
Degree of freedom of the Chi-squared distribution:  4 
Chi-squared p-value:  0.001001433 
bd <- bootdist(fd2)
Parametric bootstrap medians and 95% percentile CI 
  Median     2.5%    97.5% 
1.627451 1.527803 1.728736 
Finally, if desired the same can be done the Bayesian way via Jags
calcData <- with(StartData,list(N=length(Goals),Goals=Goals))
mijnmodel <- function() {
  for (i in 1:N) {
    Goals[i] ~ dpois(mu)
  mu ~ dunif(0,3)
jags.inits <- function() {
      mu = runif(1,0,3)
jags(data=calcData, inits=jags.inits, jags.params,model.file='model.file')
Inference for Bugs model at "model.file", fit using jags,
 3 chains, each with 2000 iterations (first 1000 discarded)
 n.sims = 3000 iterations saved
         mu.vect sd.vect     2.5%      25%      50%      75%    97.5%  Rhat n.eff
mu         1.633   0.051    1.533    1.598    1.632    1.667    1.732 1.001  2100
devianc 2048.287   1.338 2047.308 2047.403 2047.757 2048.671 2052.118 1.006  1500

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 0.9 and DIC = 2049.2
DIC is an estimate of expected predictive error (lower deviance is better).
This leads to essentially the same result as before.


Three ways to get parameters of a distribution are used. The most obvious way, via MASS gives parameters and standard deviation. A dedicated package gives all that plus significance and distribution. The Bayesian way gives the distribution too and uses most code by far. The estimates do not differ in any relevant way.

Tuesday, August 14, 2012

Random and fixed effects in sensory profiling

I am reading Introduction into mixed modelling by N.W. Galway. It is partly a repeat of things I know, but I expect to use mixed models quite a lot the coming time, so it is good to repeat these things.
My problem with this book is a sensory example in chapter 2. It is profiling data, but with some twists, as happens in real life. He also makes some odd choices.


The test concerns four hot products (ravioli). Because the products are hot, it was found not feasible to use a proper balanced design within each day (e.g. Williams designs). Within a day each presentation (he used presentation for rounds) has one product, as shown in the table below. Which is fine in a way, I do believe in making designs which be executed in practice. However, I come from a facility where we would at least have tried to solve this with the 'au-bain-marie'.

Table 1, allocation of products A to D over presentations and days.
Day Presentation 1 Presentation 2 Presentation 3 Presentation 4
1 B A C D
2 C D B A
3 A C B D

Within each presentation the nine assessors get served portions. It was randomized though not registered who got what serving. Even so, a random variable was created to serve as proxy. I understand this does not make a difference if servings are nested in presentations and days. Still, it removes the option of actually looking at this effect except as nested within servings and days. I can actually imagine that the product served first is different, especially in case of sloppy sensory practices, so it would be nice to be able to check this.


The following effects were used in the model(s)
Table 2. Allocation of effects
Effect Allocation
Day Random
Day.Presentation Random
Day.Presentation.Serving Random
Product Fixed
Assessor Fixed
Product.Assessor Fixed

Here I got my ax to grind.

  • Product.Assessor. To quote: '(assessor) ANA perceived Brand B as the least salty, whereas (assessor) GUI perceived Brand  A as the least salty. Such crossover effects may be important: they suggest an obstacle to designing a brand that will please all consumers'. In my (industry) experience the sensory panel may well be in a different country or even continent as the target market, so that is a bit too strong. Besides, nine assessors is a bit low to segment groups on. Typically segmenting would be done with a consumer group, say 120 persons, in the target market. In my view  sensory is intended to provide an objective measurement. The assessors are representing what a typical human may taste, and are hence random. With assessors random, product.assessor can only be found random too.
  • Day.Presentation. To quote: 'Presentation 1 on Day 1 is not the the same as presentation 1 on day 2' and 'it would therefore not be meaningful to obtain the mean for presentation 1 over days'. Actually, presentation 1 is the same on day 1 as on day 2 . It is the tasting with a clean palette. While palette cleansing is supposed to make all presentations equal, that does not mean it really does. After all, this is on the trade-off between production (shorter cleaning time) and correctness (longer cleaning time). Looking at this effect is important. I would make it fixed. For round 1 specifically, it represents monadic tasting as in a consumer test. Surely a first impression is important to understand consumer liking, when correlating consumer data and sensory data I might look at solely presentation 1. Having said that, given the design in table 1, I would make presentation random. The information is not there to do anything else. 


I dislike the design, the sensory foundation and interpretation. However, if you ignore that sensory based dislike, the model actually makes sense and is a nice example. On top of that, the book has R code using lme (nlme package), with code on the download site has both nlme and lme4 examples, so is up to date. I am looking forward to reading the next chapters.

Wednesday, August 1, 2012

Trying Julia

In my previous post I tried building Williams designs in R. Since that code was running a bit slow, this was an ideal test for Julia. Big enough to be at least slightly realistic, small enough that it is doable.
I am very impressed. Almost twenty fold speed increase, even though this was the best I could do in R, the most naive way possible in Julia.
                                       R              Julia        Ratio
Double Williams design 4, 100 times    3.97 sec       0.212 sec    0.053   
Williams design 5, 10 times          289.5 sec       18.54  sec    0.064 

I'd surely love to use Julia more. For instance, if I could port some of the algorithm's of Professor Ng Machine Learning class to Julia? I don't think it is possible yet, but what is not, may come.

Julia code

macro timeit(ex,name,num)
        t0 = 0
        for i=1:$num
            t0 = t0 + @elapsed $ex
        println("julia,", $name, ",", t0)

function gendesign(ncol)
  desmat = zeros(Uint8,nrow,ncol)
  desmat[1,1:ncol] = [1:ncol]
  for i = [1:ncol]
    desmat[2*i-1,1] = i
  carover = zeros(Uint8,ncol,ncol)
  for i = 1:(ncol-1)
    carover[i,i+1] = 1
  count   = 0
  addpoint( desmat , carover,count)

function numzero(matin) 
  length(matin) - nnz(matin) 
function first0(matin)
   if numzero(matin)==0
      return -1
   nrow, ncol = size(matin)
   for row = 1:nrow
      for col = 1:ncol
         if matin[row,col] == 0
            return row, col

function addpoint(desmat,carover,count)
  if nnz(desmat) == length(desmat)
     count +=1
     return count
  row,col  = first0(desmat)
  for i = [1:size(desmat,2)]
     if numzero(desmat[row,:]-i) == 0
        if numzero(desmat[:,col]-i) < 2
           if carover[desmat[row,col-1],i] < 2 
              if (col !=2) | (desmat[row,1] != desmat[row-1,1]) | (desmat[row-1,col] < i)
                 carover[desmat[row,col-1],i] +=1
                 count = addpoint(desmat,carover,count)
                 carover[desmat[row,col-1],i] -=1
   return count

@timeit gendesign(4) "design 4 " 100
@timeit gendesign(5) "design 5 " 10


Both designs were created using the same script. If you ask an even number of design points using the odd algorithm, then this will work. You just get a design with double the rows, carry over balanced, every treatment equally often in each row and each column.