Wiekvoet

Sunday, October 28, 2012

Mixed distribution

In the football data there was some reason to use a mixed distribution in the football data (ref) so I tried doing that. It was more difficult than I expected. Not only is a mixture of distributions fairly difficult, also the system was over parameterized, causing grief.

mixed distribution

data from a mixed distribution is simply said data which is from a combination of distributions. Most obvious example is length of people; males are longer than females. Combine their lengths and a joint non normal distribution appears. Separate them, two normal distributions. This problem is equivalent to the eyes problem in winbugs. Except that I use JAGS and it was supposed to be part of a larger model. The example section in JAGS shows two ways how to do this. In the end I chose to use the dnormmix distribution in the mix module. To quote the JAGS manual: 'The mix module defines a novel distribution dnormmix(mu,tau,pi) representing a finite mixture of normal distributions' and 'If you want to use the dnormmix distribution but do not care about label switching, then you can disable the tempered transition sampler with
set factory "mix::TemperedMix" off, type(sampler)'
That is me, so here is how to do that in R:

library(R2jags)

Loading required package: coda

Loading required package: lattice

Loading required package: R2WinBUGS

Loading required package: rjags

linking to JAGS 3.3.0

module basemod loaded

module bugs loaded

Loading required package: abind

Loading required package: parallel

Attaching package: 'R2jags'

The following object(s) are masked from 'package:coda':

traceplot

load.module("mix")

module mix loaded

set.factory("mix::TemperedMix", 'sampler', FALSE)

NULL

list.factories('sampler')

factory status

1 mix::TemperedMix FALSE

2 bugs::DSum TRUE

3 bugs::Conjugate TRUE

4 bugs::Dirichlet TRUE

5 bugs::MNormal TRUE

6 base::Finite TRUE

7 base::Slice TRUE

football model

This is the previous model with dnormmix grafted in and overall strength with attack/defense removed in an attempt to simplify and understand the estimation problems . Note that Astr[1] is fixed to 1, Dstr[1] is almost fixed to 0 and all other AStr and DStr are free. I am not 100% sure about those parts. Astr needs to be fixed because otherwise the whole set of Astr[] and Dstr[] changes in a similar way over the MCMC samples. It is their relative numbers that count. Somehow, there is still a need to fix Dstr[1], as there is still an overall change. Yet, I have the feeling fixing Astr[1] gives information to determine Dstr[2:nclub] hence Astr[2:nclub] hence Dstr[1]. Yet, they all move similar, and Dstr[1] is clearly not 0 as suggested by the prior.

fbmodel1 <- function() {

for (i in 1:N) {

HomeMadeGoals[i] ~ dpois(HS[i])

OutMadeGoals[i] ~ dpois(OS[i])

log(HS[i]) <- Home + Dstr[OutClub[i]] + Astr[HomeClub[i]]

log(OS[i]) <- Dstr[HomeClub[i]] + Astr[OutClub[i]]

}

Astr[1] <- 1

Dstr[1] ~ dnorm(0,10)

for (i in 2:nclub) {

Astr[i] ~ dnormmix(MAStr,tauAStr1,EtaAStr1)

Dstr[i] ~ dnormmix(MDStr,tauDStr1,EtaDStr1)

}

for (i in 1:3) {

MAStr[i] ~ dnorm(0,.01)

MDStr[i] ~ dnorm(0,.01)

tauDStr1[i] <- tauDStr

tauAStr1[i] <- tauAStr

eee[i] <- 3

}

EtaAStr1[1:3] ~ ddirch(eee[1:3])

EtaDStr1[1:3] ~ ddirch(eee[1:3])

sigmaAstr <- 1/sqrt(tauAStr)

tauAStr ~ dgamma(.001,.001)

sigmaDstr <- 1/sqrt(tauDStr)

tauDStr ~ dgamma(.001,.001)

Home ~ dnorm(0,.0001)

}

params <- c("Dstr","Astr","sigmaAstr","sigmaDstr","Home")

inits <- function(){

list(TopStr=rnorm(JAGSData$nclub),

AD=rnorm(JAGSData$nclub),

sigmaAD=runif(1),

sigmaStr=runif(0,.5),

Home=rnorm(1)

)

}

jagsfit <- jags(JAGSData, model=fbmodel1, inits=inits,

parameters=params,progress.bar="gui",

n.iter=5000)

jagsfit

Inference for Bugs model at "C:/Users/.../RtmpSGssTD/modeld94a9b3849.txt", fit using jags,

3 chains, each with 15000 iterations (first 7500 discarded), n.thin = 7

n.sims = 3216 iterations saved

mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff

Astr[1] 1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1

Astr[2] 1.641 0.160 1.338 1.533 1.639 1.748 1.971 1.033 69

Astr[3] 1.328 0.196 0.943 1.196 1.324 1.463 1.710 1.023 91

Astr[4] 0.878 0.185 0.513 0.753 0.880 1.007 1.228 1.021 120

Astr[5] 0.753 0.207 0.326 0.617 0.755 0.894 1.143 1.017 140

Astr[6] 0.949 0.177 0.598 0.836 0.950 1.065 1.304 1.021 110

Astr[7] 1.559 0.158 1.257 1.450 1.559 1.669 1.875 1.032 73

Astr[8] 1.167 0.193 0.804 1.034 1.161 1.293 1.566 1.022 120

Astr[9] 1.426 0.180 1.078 1.304 1.429 1.551 1.756 1.029 84

Astr[10] 1.117 0.187 0.764 0.990 1.114 1.237 1.499 1.024 99

Astr[11] 1.006 0.179 0.660 0.889 1.003 1.121 1.368 1.024 98

Astr[12] 0.955 0.181 0.603 0.830 0.957 1.073 1.312 1.024 91

Astr[13] 1.599 0.159 1.296 1.490 1.599 1.712 1.919 1.034 71

Astr[14] 0.927 0.179 0.566 0.809 0.928 1.053 1.277 1.025 90

Astr[15] 1.178 0.195 0.810 1.046 1.174 1.308 1.574 1.016 160

Astr[16] 1.539 0.164 1.216 1.429 1.538 1.656 1.854 1.034 70

Astr[17] 1.040 0.183 0.693 0.917 1.039 1.154 1.414 1.019 130

Astr[18] 0.968 0.180 0.614 0.849 0.971 1.090 1.315 1.029 91

Dstr[1] -0.640 0.162 -0.977 -0.747 -0.638 -0.530 -0.331 1.031 79

Dstr[2] -1.183 0.179 -1.533 -1.301 -1.183 -1.062 -0.844 1.034 66

Dstr[3] -1.203 0.181 -1.568 -1.323 -1.196 -1.083 -0.853 1.023 97

Dstr[4] -0.709 0.160 -1.011 -0.814 -0.710 -0.604 -0.400 1.040 58

Dstr[5] -0.697 0.162 -1.020 -0.804 -0.697 -0.591 -0.375 1.050 46

Dstr[6] -0.840 0.177 -1.208 -0.956 -0.833 -0.719 -0.508 1.031 70

Dstr[7] -1.054 0.189 -1.413 -1.191 -1.055 -0.927 -0.675 1.039 62

Dstr[8] -0.869 0.181 -1.237 -0.987 -0.862 -0.743 -0.538 1.023 91

Dstr[9] -1.177 0.177 -1.533 -1.293 -1.178 -1.059 -0.832 1.022 100

Dstr[10] -0.821 0.169 -1.174 -0.929 -0.812 -0.704 -0.503 1.022 98

Dstr[11] -0.944 0.188 -1.307 -1.074 -0.935 -0.812 -0.593 1.023 95

Dstr[12] -1.092 0.177 -1.433 -1.216 -1.092 -0.972 -0.739 1.024 87

Dstr[13] -1.030 0.188 -1.386 -1.162 -1.036 -0.900 -0.650 1.023 100

Dstr[14] -1.031 0.189 -1.384 -1.160 -1.028 -0.899 -0.665 1.035 64

Dstr[15] -0.735 0.164 -1.065 -0.843 -0.734 -0.624 -0.418 1.042 54

Dstr[16] -0.836 0.178 -1.212 -0.952 -0.830 -0.714 -0.500 1.025 88

Dstr[17] -1.116 0.178 -1.456 -1.237 -1.121 -0.998 -0.758 1.042 55

Dstr[18] -0.681 0.163 -1.005 -0.791 -0.681 -0.571 -0.367 1.041 54

Home 0.334 0.062 0.210 0.293 0.335 0.376 0.453 1.012 170

sigmaAstr 0.185 0.103 0.042 0.102 0.169 0.254 0.410 1.014 220

sigmaDstr 0.135 0.079 0.028 0.067 0.123 0.190 0.301 1.033 68

deviance 1889.263 8.560 1873.823 1883.430 1888.635 1894.484 1907.580 1.001 2400

For each parameter, n.eff is a crude measure of effective sample size,

and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)

pD = 36.6 and DIC = 1925.9

DIC is an estimate of expected predictive error (lower deviance is better).

Sunday, October 14, 2012

Putting a football model into JAGS

In this post the football model is programmed into JAGS. There are all the reasons to do so. Jags 3.3 is recently released, I was stimulated by Gianluca's post . Obviously I could copy the model in his paper, but that would be too easy and not sure about copyright either. So, in this post, variations of the model of an earlier post is recreated.

Data structure

The starting data looks like this:

head(Jold)

Seizoen Datum Thuisclub Uitclub Thuisscore Uitscore

1 2011-2012 2011-08-05 Excelsior Feyenoord 0 2

2 2011-2012 2011-08-06 RKC Waalwijk Heracles Almelo 2 2

3 2011-2012 2011-08-06 Roda JC FC Groningen 2 1

4 2011-2012 2011-08-06 SC Heerenveen NEC 2 2

5 2011-2012 2011-08-06 VVV-Venlo FC Utrecht 0 0

6 2011-2012 2011-08-07 ADO Den Haag Vitesse 0 0

The data has to be made ready to be placed into JAGS. All factors have to be made numbers. Note that each game is still one row. The two outcomes will be resolved in JAGS

JAGSData <- with(Jold,list(

  N=length(Datum),

  nclub = nlevels(Thuisclub),

  HomeClub=(1:nlevels(Thuisclub))[Thuisclub],

  OutClub=(1:nlevels(Uitclub))[Uitclub],

  HomeMadeGoals=Thuisscore,

  OutMadeGoals=Uitscore))

str(JAGSData)

List of 6

 $ N            : int 306

 $ nclub        : int 18

 $ HomeClub     : int [1:306] 5 14 15 16 18 1 3 4 11 7 ...

 $ OutClub      : int [1:306] 9 10 6 12 8 17 13 2 7 3 ...

 $ HomeMadeGoals: int [1:306] 0 2 2 2 0 0 3 1 0 2 ...

 $ OutMadeGoals : int [1:306] 2 2 1 2 0 0 1 4 1 0 ...

Model

The model in JAGS has to contain all the details and all distributions. That makes it much longer, but also gives the opportunity to fiddle around with details. For now it is reasonably simple. Each club has two properties, an attack strength (Astr) and a defense strength (Dstr). Based on the combination of these strengths is the number of goals, again Poisson distributed. The interesting part is in the way attack and defense strength are constructed. These are created from an overall strength (TopStr) and an attack/defense balance (AD).
fbmodel1 <- function() {
for (i in 1:N) {
HomeMadeGoals[i] ~ dpois(HS[i])
OutMadeGoals[i] ~ dpois(OS[i])
log(HS[i]) <- Home + Astr[HomeClub[i]] + Dstr[OutClub[i]]
log(OS[i]) <- Dstr[HomeClub[i]] + Astr[OutClub[i]]
}
for (i in 1:nclub) {
Astr[i] <- TopStr[i]+AD[i]
Dstr[i] <- TopStr[i]-AD[i]
TopStr[i] ~ dnorm(0,tauTop)
AD[i] ~ dnorm(0,tauAD)
}
tauAD <- pow(sigmaAD,-2)
sigmaAD ~ dunif(0,100)
tauTop <- pow(sigmaTop,-2)
sigmaTop ~ dunif(0,100)
Home ~ dnorm(0,.0001)
}

To make this model work it needs a series on incantations in R. To keep things short only a basic output plot is given

params <- c("TopStr","AD","sigmaAD","sigmaTop","Home")
inits <- function(){
list(TopStr=rnorm(JAGSData$nclub),
AD=rnorm(JAGSData$nclub),
sigmaAD=runif(1),
sigmaTop=runif(0,.5),
Home=rnorm(1)
)
}
jagsfit <- jags(JAGSData, model=fbmodel1, inits=inits,
parameters=params,progress.bar="gui",
n.iter=4000)
plot(jagsfit)

Parameter extraction

As the interest is in the teams, their properties are extracted. As Gianluca wrote we considered a slightly more complex structure in which we included information on each team's propensity to be "good", "average", or "poor". This helped avoid overshrinkage in the estimations To examine this I looked at the teams relative scores.The plots look like they have a shoulder, so there is a point there.

jags.smat <- jagsfit$BUGSoutput$summary

for (i in 1:nlevels(Jold$Thuisclub))

rownames(jags.smat) <- sub(paste('[',i,']',sep=''),levels(Jold$Thuisclub)[i],

rownames(jags.smat),fixed=TRUE)

plot(density(jags.smat[grep('^Top',rownames(jags.smat)),1])

,main='Distribution of Strength')

points(x=jags.smat[grep('^Top',rownames(jags.smat)),1],

y=rep(0,nlevels(Jold$Thuisclub)))

plot(density(jags.smat[grep('^AD',rownames(jags.smat)),1]),

main='Distribution of Attack / Defense')

points(x=jags.smat[ grep('^AD',rownames(jags.smat)),1],

y=rep(0,nlevels(Jold$Thuisclub)))

model variation

There are a number of ways to create distributions for attack and defense strength (Astr and Dstr). One variation is to make them independent, but also to give them fatter tails, which is handled by a t distribution rather than a normal distribution. This fatter tail could be accommodate teams much better or worse than the majority. Unfortunately this did not work, the number of degrees of freedom for the t distribution varied a lot and was too high to make this an alternative model.

# section replaced in fbmodel1

for (i in 1:nclub) {

Astr[i] ~ dt(0,tauStr,nuStr)

Dstr[i] ~ dt(0,tauStr,nuStr)

}

tauStr <- pow(sigmaStr,-2)

sigmaStr ~ dunif(0,100)

nuStr <- 1/InuStr

InuStr ~ dunif(0,.5)

Sunday, October 7, 2012

Footbal ordinal model: examination and predictions

In the previous entry an ordinal model for football games was developed. It is now time to look a bit better at the model and use it. This means three sections; A look at likelihood and link function, a model interpretation part, which focuses on the effect of playing away or at home and a look at some predictions.

model definition

Just to recall the model defined.

clm4b <- clm(oGoals ~OffenseClub + DefenseClub*OffThuis, data=StartData)

link function and likelihood

Looking at the vignette 'Analysis of ordinal data with cumulative link models — estimation with the R-package ordinal' there is a section where the link function is examined. I am not the expert here, but logit, probit and cloglog are most often used. Logit and probit are similar, except for extreme results. cloglog and loglog are asymetrical. Cauchit, I never encountered in the literature, but since it's there it is taken along. A model with logit link is a proportional odds model, a model with cloglog link is a proportional hazard model. See also McCullagh and Nelder (2nd edition, 1989).

The example in the vignette ran thus:

links <- c("logit", "probit", "cloglog", "loglog", "cauchit")

sapply(links, function(link) {

clm(oGoals ~OffenseClub + DefenseClub*OffThuis

,link=link,data=StartData)$logLik })

logit probit cloglog loglog cauchit

-906.3591 -906.9762 -914.2396 -915.1181 -924.0992

Luckily the proportional odds model holds.

Slice can be used to plot the behavior of the likelihood as function of parameter estimates. It does give one plot for each parameter. To keep this post short only the first 9 are shown. These include the estimates of the category thresholds. It can be seen that thresholds 5|6 and 6|7 are asymmetrical, but close to the maximum likelihood value they are close to quadratic (detail plot not shown).

slice.fm4b <- slice(clm4b, lambda = 5)
par(mfrow = c(3,3))
plot(slice.fm4b,1:9)

Parameter interpretation

The most interesting parameter to assess is the interaction; DefenseClub*OffThuis. It is not simple to do so, all parameters are dependent on each other and the odd parameter is not present to keep the model estimable. As a way out the difference for each team between playing at home or away is used. Ideally this is against a similar team, so. The chances for a team when playing against itself are used. It is a very abstract idea; a team which plays at home against itself when away. The merit is in the interpretation.
library(lattice)
teams <- data.frame(Off=levels(StartData$OffenseClub),
Def=levels(StartData$OffenseClub))
homeaway <- morepred(clm4b,teams)
longha <- reshape(homeaway[,-1],
idvar='club2',varying=list(chance=names(homeaway)[3:5]),
direction='long',v.names='chance',
times=c('Home','Draw','Away'),timevar='Location')
dotplot(club2 ~ chance,groups=Location,data=longha,
auto.key=list(columns=3,space='bottom'),
xlim=c(0,1),xlab='Chance',
main='Chance of a club to win from itself, depending on home or away')

As can be seen, VVV-Venlo has a lot of advantage playing at home, as do Heracles Almelo, AZ and ADO Den Haag (these all have a black dot far right and a blue dot far left). In contrast, SC Heerenveen, FC Utrecht and De Graafschap show better chances away than at home. This I don't believe, if this were a Bayesian model this believe might be enforced, as this is a frequentist model I have to live the estimates. Clearly the plot could be improved with confidence intervals, hopefully showing quite some overlap for these estimates. Going Bayesian is clearly in the frame some when, as Gianluca showed.

Predictions

Previously it was attempted to predict the outcomes of one weekend of games of current season. The results and predictions given below.

Roda JC - Utrecht 0-1

PEC Zwolle - Groningen 1-2

RKC Waalwijk - VVV 1-1

Vitesse - Heracles 1-1

NEC - Willem II 0-0

ADO Den Haag - Ajax 1-1

Twente - Heerenveen 1-0

NAC Breda - AZ 2-1

PSV - Feyenoord 3-0

These are the original expectations:

club1 club2 win1 equal win2

1 Roda JC FC Utrecht 0.4580659 0.2126926 0.3291782

2 RKC Waalwijk VVV-Venlo 0.6076020 0.2180298 0.1743364

3 Vitesse Heracles Almelo 0.5723334 0.2275537 0.2000907

4 ADO Den Haag Ajax 0.1037534 0.1511710 0.7446496

5 FC Twente SC Heerenveen 0.6605607 0.1558135 0.1822923

6 NAC Breda AZ 0.2539698 0.2627759 0.4832506

7 PSV Feyenoord 0.5055082 0.2147899 0.2796468

The new model has different predictions; Most interesting FC Utrecht has now a slightly better chance to win than lose, while in the previous model it was predicted to lose. Big changes are in RKC Waalwijk-VVV Venlo and FC Twente-SC Heerenveen.

         club1           club2       win1     equal      win2

1      Roda JC      FC Utrecht 0.36966744 0.2506325 0.3797000

2 RKC Waalwijk       VVV-Venlo 0.78326557 0.1343054 0.0824290

3      Vitesse Heracles Almelo 0.69169975 0.1776653 0.1306350

4 ADO Den Haag            Ajax 0.09980196 0.1591822 0.7410159

5    FC Twente   SC Heerenveen 0.47168784 0.2140003 0.3143119

6    NAC Breda              AZ 0.29419688 0.2577802 0.4480229

7          PSV       Feyenoord 0.47611532 0.2253700 0.2985147

code for predictions:

topred <- read.table(textConnection("

'Roda JC' 'FC Utrecht'

'PEC Zwolle' 'FC Groningen'

'RKC Waalwijk' 'VVV-Venlo'

'Vitesse' 'Heracles Almelo'

'NEC' 'Willem II'

'ADO Den Haag' 'Ajax'

'FC Twente' 'SC Heerenveen'

'NAC Breda' 'AZ'

'PSV' 'Feyenoord'"

),col.names=c('Off','Def'))

morepred(clm4b,topred)

Additional code

fbpredict <- function(object,club1,club2) {

UseMethod('fbpredict',object)

}

fbpredict.polr <- function(object,club1,club2) {

top <- data.frame(OffenseClub=c(club1,club2),DefenseClub=c(club2,club1),OffThuis=c(1,0))

prepred <- predict(object,top,type='p')

oo <- outer(prepred[2,],prepred[1,])

rownames(oo) <- 0:(ncol(prepred)-1)

colnames(oo) <- rownames(oo)

class(oo) <- c('fboo',class(oo))

attr(oo,'row') <- club1

attr(oo,'col') <- club2

wel <- c(sum(oo[upper.tri(oo)]),sum(diag(oo)),sum(oo[lower.tri(oo)]))

names(wel) <- c(club1,'draw',club2)

return(list(details=oo,'summary chances'=wel))

}

fbpredict.clm <- function(object,club1,club2) {

top <- data.frame(OffenseClub=c(club1,club2),DefenseClub=c(club2,club1),OffThuis=c(1,0))

prepred <- predict(object,top,type='p')$fit

oo <- outer(prepred[2,],prepred[1,])

rownames(oo) <- 0:(ncol(prepred)-1)

colnames(oo) <- rownames(oo)

class(oo) <- c('fboo',class(oo))

attr(oo,'row') <- club1

attr(oo,'col') <- club2

wel <- c(sum(oo[upper.tri(oo)]),sum(diag(oo)),sum(oo[lower.tri(oo)]))

names(wel) <- c(club1,'draw',club2)

return(list(details=oo,'summary chances'=wel))

}

print.fboo <- function(x,...) {

cat(attr(x,'row'),'in rows against',attr(x,'col'),'in columns \n')

class(x) <- class(x)[-1]

attr(x,'row') <- NULL

attr(x,'col') <- NULL

oo <- formatC(x,format='f',width=4)

oo <- gsub('\\.0+$',' ',oo)

oo <- substr(oo,1,6)

print(oo,quote=FALSE,justify='left')

}

morepred <- function(mymodel,topred) {

UseMethod('morepred',mymodel)

}

morepred.polr <- function(mymodel,topred) {

topred <- topred[topred[,1] %in% mymodel$xlevels$OffenseClub &

topred[,2] %in% mymodel$xlevels$OffenseClub ,]

ap <- lapply(1:nrow(topred),function(irow) {

fbp <- fbpredict(mymodel,as.character(topred[irow,1]),

as.character(topred[irow,2]))

sec2 <- fbp[[2]]

mydf <- data.frame(club1=topred[irow,1],

club2=topred[irow,2],

win1=sec2[1],

equal=sec2[2],

win2=sec2[3])

})

dc <- do.call(rbind,ap)

rownames(dc) <- 1:nrow(dc)

}

morepred.sclm <- function(mymodel,topred) {

topred <- topred[topred[,1] %in% mymodel$xlevels$OffenseClub &

topred[,2] %in% mymodel$xlevels$OffenseClub ,]

ap <- lapply(1:nrow(topred),function(irow) {

fbp <- fbpredict(mymodel,as.character(topred[irow,1]),

as.character(topred[irow,2]))

sec2 <- fbp[[2]]

mydf <- data.frame(club1=topred[irow,1],

club2=topred[irow,2],

win1=sec2[1],

equal=sec2[2],

win2=sec2[3])

})

dc <- do.call(rbind,ap)

rownames(dc) <- 1:nrow(dc)

}

Sunday, September 30, 2012

Football, an ordinal model

On September 19^th, flo2speak remarked under a post that his/her experience is that ordinal models had better performance. That seems reason enough to try, so there we are. In examining this type of model it is found that more complex models can be used. There is now an interaction between defense powers and home/away and an effect of winter stop. This leads to a change in prediction values.

Ordinal model and Log linear regression

ordinal model

An ordinal model has as a response a series of ordered categories. An example is liking of a food product (very much dislike, dislike, neutral, like, like very much) or stars for an Amazon product. Essential is that these are a fixed number of ordered categories. For football, the goals could be categories, although this does not have the fixed number of categories. In practice, this is not much of a problem, higher number of goals hardly occur, and the highest category is just including more goals up to infinity.

An ordinal model describes the chances for each of the categories. One might imagine this model as giving a continuous response, plus a mapping how this response is translated into the categories. So, if category A has been given limits l₁ and l₂, then the probability that a response falls between l₁ and l₂, is the probability of category A.

One of the nice properties of the ordinal model is that the categories don't have to be the same size. It does not matter if the step from zero to one star is the same as the step from four to five stars. The model will adapt for that.

log linear model

The log linear model also predicts on a continuous scale. This scale is via the inverse-log (exponential) translated to a positive number. This number is then Poisson distributed. This means that a prediction has a confidence interval from the model, plus the variation due to the Poisson distribution. It should be noted that up to now my predictions were too precise, I only included the Poisson part. It also means that the predictions cannot become more 'narrow' for lack of a better word. It is not possible to get predictions which are saying there is 80% chance of one goal, because the Poisson distribution cannot make such outcomes. I can see that as a potential disadvantage.

Application

Ordinal model in R

There are at least three packages with functions which can be used to fit ordinal models. The most obvious one is MASS which has the function polr. Almost as obvious is the package ordinal, which is a bit more extensive than just using polr. Finally, package VGAM has most capabilities. In this post I will do one demonstration of polr, and for the remainder use clm from package ordinal. VGAM is not used.
To demonstrate the equivalence between polr and clm:
top <- data.frame(OffenseClub=c('FC Groningen','Vitesse')

StartData$oGoals <- ordered(StartData$Goals)

(pol1 <- polr(oGoals ~OffenseClub,data=StartData))
Call:
polr(formula = oGoals ~ OffenseClub, data = StartData)

Coefficients:
OffenseClubAjax OffenseClubAZ
2.16090602 1.10515167
OffenseClubDe Graafschap OffenseClubExcelsior
-0.07000361 -0.47817103
OffenseClubFC Groningen OffenseClubFC Twente
0.05437062 1.61304748
OffenseClubFC Utrecht OffenseClubFeyenoord
0.72672725 1.40465078
OffenseClubHeracles Almelo OffenseClubNAC Breda
0.62502257 0.38942582
OffenseClubNEC OffenseClubPSV
0.26579358 1.77961585
OffenseClubRKC Waalwijk OffenseClubRoda JC
0.07824836 0.63463829
OffenseClubSC Heerenveen OffenseClubVitesse
1.64087778 0.42797219
OffenseClubVVV-Venlo
0.14843577

Intercepts:
0|1 1|2 2|3 3|4 4|5 5|6 6|7
-0.5749746 0.8583642 1.9763217 2.9660666 4.1504225 4.9698549 6.2901004

Residual Deviance: 1929.761
AIC: 1977.761
(clm1 <- clm(oGoals ~OffenseClub ,data=StartData))
formula: oGoals ~ OffenseClub
data: StartData

link threshold nobs logLik AIC niter max.grad
logit flexible 612 -964.88 1977.76 6(0) 6.04e-13

Coefficients:
OffenseClubAjax OffenseClubAZ
2.16095 1.10517
OffenseClubDe Graafschap OffenseClubExcelsior
-0.07005 -0.47813
OffenseClubFC Groningen OffenseClubFC Twente
0.05446 1.61290
OffenseClubFC Utrecht OffenseClubFeyenoord
0.72677 1.40463
OffenseClubHeracles Almelo OffenseClubNAC Breda
0.62502 0.38933
OffenseClubNEC OffenseClubPSV
0.26583 1.77970
OffenseClubRKC Waalwijk OffenseClubRoda JC
0.07833 0.63467
OffenseClubSC Heerenveen OffenseClubVitesse
1.64085 0.42778
OffenseClubVVV-Venlo
0.14849

Threshold coefficients:
0|1 1|2 2|3 3|4 4|5 5|6 6|7
-0.5750 0.8584 1.9763 2.9661 4.1504 4.9699 6.2902

predict(pol1,top,type='p')
0 1 2 3 4 5 6
1 0.3476590 0.3431691 0.1815278 0.07606575 0.03521247 0.009087138 0.005324424
2 0.2683624 0.3376048 0.2187080 0.10209439 0.04962636 0.013063007 0.007703923
7
1 0.001954373
2 0.002837110

predict(clm1,top,type='p')
$fit
0 1 2 3 4 5 6
1 0.3476380 0.3431713 0.1815361 0.07607152 0.03521588 0.009087829 0.005325017
2 0.2684004 0.3376141 0.2186884 0.10207952 0.04961821 0.013060380 0.007702605
7
1 0.001954409
2 0.002836354

Selecting a start model

The feasible models are along the same lines as before. Offense, defense, home/away team, first/second half of the season. It is obvious that teams and hone/away are statistically significant.

clm0 <- clm(oGoals ~1,data=StartData)

clm2 <- clm(oGoals ~OffenseClub + DefenseClub,data=StartData)

clm3 <- clm(oGoals ~OffenseClub + DefenseClub +

OffThuis,data=StartData)

anova(clm0,clm1,clm2,clm3)

Likelihood ratio tests of cumulative link models:

formula: link: threshold:

clm0 oGoals ~ 1 logit flexible

clm1 oGoals ~ OffenseClub logit flexible

clm2 oGoals ~ OffenseClub + DefenseClub logit flexible

clm3 oGoals ~ OffenseClub + DefenseClub + OffThuis logit flexible

no.par AIC logLik LR.stat df Pr(>Chisq)

clm0 7 2038.1 -1012.03

clm1 24 1977.8 -964.88 94.305 17 9.984e-13 ***

clm2 41 1951.1 -934.54 60.677 17 8.122e-07 ***

clm3 42 1922.3 -919.14 30.806 1 2.851e-08 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So, how does this model clm3 compare with model3? Using a slightly redrafted function fbpredict (see bottom of post) the prediction Roda JC vs FC Utrecht the individual outcomes are a bit different, but the summary is close enough not to make much differences. This is not so strange, in many ways the same model is build.
fbpredict(clm3,"Roda JC","FC Utrecht")
$details
Roda JC in rows against FC Utrecht in columns
0 1 2 3 4 5 6 7
0 0.0173 0.0424 0.0477 0.0297 0.0155 0.0040 0.0024 0.0009
1 0.0349 0.0856 0.0964 0.0599 0.0313 0.0082 0.0048 0.0018
2 0.0303 0.0741 0.0835 0.0519 0.0271 0.0071 0.0042 0.0015
3 0.0153 0.0375 0.0423 0.0263 0.0137 0.0036 0.0021 0.0008
4 0.0072 0.0176 0.0198 0.0123 0.0064 0.0017 0.0010 0.0004
5 0.0018 0.0044 0.0049 0.0031 0.0016 0.0004 0.0002 0.0001
6 0.0010 0.0026 0.0029 0.0018 0.0009 0.0002 0.0001 0.0001
7 0.0004 0.0009 0.0010 0.0006 0.0003 0.0001 0.0001 0

$`summary chances`
Roda JC equal FC Utrecht
0.4603544 0.2196707 0.3199749

Selecting model extensions; interactions

It is also possible to extend the model. This shows that there is an interaction between defense capabilities and playing home or away. The statistical significance is about p=0.06, which is low enough to consider this model for prediction purposes.

clm4a <- clm(oGoals ~OffenseClub*OffThuis + DefenseClub
,data=StartData)
clm4b <- clm(oGoals ~OffenseClub + DefenseClub*OffThuis
,data=StartData)
clm5 <- clm(oGoals ~(OffenseClub + DefenseClub)*OffThuis
,data=StartData)
anova (clm3,clm4a,clm5)
Likelihood ratio tests of cumulative link models:

formula: link: threshold:
clm3 oGoals ~ OffenseClub + DefenseClub + OffThuis logit flexible
clm4a oGoals ~ OffenseClub * OffThuis + DefenseClub logit flexible
clm5 oGoals ~ (OffenseClub + DefenseClub) * OffThuis logit flexible

no.par AIC logLik LR.stat df Pr(>Chisq)
clm3 42 1922.3 -919.14
clm4a 59 1938.2 -910.08 18.115 17 0.38164
clm5 76 1945.6 -896.81 26.552 17 0.06497 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova (clm3,clm4b,clm5)
Likelihood ratio tests of cumulative link models:

formula: link: threshold:
clm3 oGoals ~ OffenseClub + DefenseClub + OffThuis logit flexible
clm4b oGoals ~ OffenseClub + DefenseClub * OffThuis logit flexible
clm5 oGoals ~ (OffenseClub + DefenseClub) * OffThuis logit flexible

no.par AIC logLik LR.stat df Pr(>Chisq)
clm3 42 1922.3 -919.14
clm4b 59 1930.7 -906.36 25.560 17 0.08286 .
clm5 76 1945.6 -896.81 19.107 17 0.32242
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The implication is that model 'clm4b' is now considered an alternative for model 'clm3'. This makes for an interesting change in predictions. It can be seen that with this model FC Utrecht has a better chance to win than with 'clm3'.

fbpredict(clm4b,"Roda JC","FC Utrecht")

$details

Roda JC in rows against FC Utrecht in columns

0 1 2 3 4 5 6 7

0 0.0379 0.0689 0.0503 0.0220 0.0093 0.0022 0.0012 0.0004

1 0.0701 0.1274 0.0931 0.0406 0.0172 0.0040 0.0023 0.0008

2 0.0523 0.0950 0.0694 0.0303 0.0128 0.0030 0.0017 0.0006

3 0.0231 0.0420 0.0307 0.0134 0.0057 0.0013 0.0007 0.0003

4 0.0098 0.0179 0.0131 0.0057 0.0024 0.0006 0.0003 0.0001

5 0.0023 0.0041 0.0030 0.0013 0.0006 0.0001 0.0001 0

6 0.0013 0.0024 0.0017 0.0008 0.0003 0.0001 0 0

7 0.0005 0.0008 0.0006 0.0003 0.0001 0 0 0

$`summary chances`

Roda JC equal FC Utrecht

0.3696674 0.2506325 0.3797000

Winter break

On top of that, the effect of before/after winter break can be examined. It would seem that winter break does have an effect. There is not much point in trying to make predictions using the model with winter break. If winter break has an effect, so does summer break. This does however, fit well with the remark of Huub on a previous post 'I tried to predict the results using only the data from the present year, and got four correct (FC Utrecht, FC Groningen, FC Twente, PSV). Maybe it is an idea to combine both the data of last year and of the current season and give a weight to the current year of, for example, 2. In this way there is still enough information in the model but it also accounts for more recent results...'. Difference between years is a reason for more recent data to perform better.

StartData$year <- factor(c(substr(old$Datum,1,4),substr(old$Datum,1,4)))

clm6 <- clm(oGoals ~OffenseClub + DefenseClub + year + OffThuis

,data=StartData)

clm7 <- clm(oGoals ~(OffenseClub + DefenseClub)*year + OffThuis

,data=StartData)

anova (clm3,clm6,clm7)

Likelihood ratio tests of cumulative link models:

formula: link: threshold:

clm3 oGoals ~ OffenseClub + DefenseClub + OffThuis logit flexible

clm6 oGoals ~ OffenseClub + DefenseClub + year + OffThuis logit flexible

clm7 oGoals ~ (OffenseClub + DefenseClub) * year + OffThuis logit flexible

no.par AIC logLik LR.stat df Pr(>Chisq)

clm3 42 1922.3 -919.14

clm6 43 1924.2 -919.12 0.0341 1 0.85356

clm7 77 1945.7 -895.84 46.5557 34 0.07407 .

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(clm3,clm7)

Likelihood ratio tests of cumulative link models:

formula: link: threshold:

clm3 oGoals ~ OffenseClub + DefenseClub + OffThuis logit flexible

clm7 oGoals ~ (OffenseClub + DefenseClub) * year + OffThuis logit flexible

no.par AIC logLik LR.stat df Pr(>Chisq)

clm3 42 1922.3 -919.14

clm7 77 1945.7 -895.84 46.59 35 0.09106 .

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Even more complex model

A final step is to combine the good of model 'clm4b' and 'clm7'. It would seem these effects can very well be combined. This means that at some point a year effect must be included.

clmX <- clm(oGoals ~(OffenseClub + DefenseClub)*year + DefenseClub*OffThuis
,data=StartData)
anova(clm3,clmX)
Likelihood ratio tests of cumulative link models:

formula: link:
clm3 oGoals ~ OffenseClub + DefenseClub + OffThuis logit
clmX oGoals ~ (OffenseClub + DefenseClub) * year + DefenseClub * OffThuis logit
threshold:
clm3 flexible
clmX flexible

no.par AIC logLik LR.stat df Pr(>Chisq)
clm3 42 1922.3 -919.14
clmX 94 1951.3 -881.67 74.942 52 0.0203 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(clm4b,clmX)
Likelihood ratio tests of cumulative link models:

formula:
clm4b oGoals ~ OffenseClub + DefenseClub * OffThuis
clmX oGoals ~ (OffenseClub + DefenseClub) * year + DefenseClub * OffThuis
link: threshold:
clm4b logit flexible
clmX logit flexible

no.par AIC logLik LR.stat df Pr(>Chisq)
clm4b 59 1930.7 -906.36
clmX 94 1951.3 -881.67 49.383 35 0.05424 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(clm7,clmX)
Likelihood ratio tests of cumulative link models:

formula: link:
clm7 oGoals ~ (OffenseClub + DefenseClub) * year + OffThuis logit
clmX oGoals ~ (OffenseClub + DefenseClub) * year + DefenseClub * OffThuis logit
threshold:
clm7 flexible
clmX flexible

no.par AIC logLik LR.stat df Pr(>Chisq)
clm7 77 1945.7 -895.84
clmX 94 1951.3 -881.67 28.353 17 0.04098 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Additional R code

fbpredict <- function(object,club1,club2) {

UseMethod('fbpredict',object)

}

fbpredict.polr <- function(object,club1,club2) {

top <- data.frame(OffenseClub=c(club1,club2),DefenseClub=c(club2,club1),OffThuis=c(1,0))

prepred <- predict(object,top,type='p')

oo <- outer(prepred[2,],prepred[1,])

rownames(oo) <- 0:(ncol(prepred)-1)

colnames(oo) <- rownames(oo)

class(oo) <- c('fboo',class(oo))

attr(oo,'row') <- club1

attr(oo,'col') <- club2

wel <- c(sum(oo[upper.tri(oo)]),sum(diag(oo)),sum(oo[lower.tri(oo)]))

names(wel) <- c(club1,'equal',club2)

return(list(details=oo,'summary chances'=wel))

}

fbpredict.clm <- function(object,club1,club2) {

top <- data.frame(OffenseClub=c(club1,club2),DefenseClub=c(club2,club1),OffThuis=c(1,0))

prepred <- predict(object,top,type='p')$fit

oo <- outer(prepred[2,],prepred[1,])

rownames(oo) <- 0:(ncol(prepred)-1)

colnames(oo) <- rownames(oo)

class(oo) <- c('fboo',class(oo))

attr(oo,'row') <- club1

attr(oo,'col') <- club2

wel <- c(sum(oo[upper.tri(oo)]),sum(diag(oo)),sum(oo[lower.tri(oo)]))

names(wel) <- c(club1,'equal',club2)

return(list(details=oo,'summary chances'=wel))

}

print.fboo <- function(x,...) {

cat(attr(x,'row'),'in rows against',attr(x,'col'),'in columns \n')

class(x) <- class(x)[-1]

attr(x,'row') <- NULL

attr(x,'col') <- NULL

oo <- formatC(x,format='f',width=4)

oo <- gsub('\\.0+$',' ',oo)

oo <- substr(oo,1,6)

print(oo,quote=FALSE,justify='left')

}