## Sunday, September 16, 2012

### Football model

After reading Dutch football data (Eeredivisie 2011-2012) and making a predictions display it is time to look at a few simple models to predict goals. To reiterate the data setup, each game played consists of two rows in the data frame. One row for the number of goals the home playing team makes, another row for the away team. We start with four models. Two models I don't believe in; A zero  model where the number of goals is independent of the clubs and everything, model 1 where the number of goals is only dependent on the team making the goals. Two other models are probable. Model 2, both the attacking and the defending team determine the number of goals, finally, model 3, both teams determine the number of goals, but also who is playing at home.

model0 <- glm(Goals ~ 1,data=StartData,family='poisson')
model1 <- glm(Goals ~OffenseClub,data=StartData,family='poisson')
model2 <- glm(Goals ~OffenseClub + DefenseClub,data=StartData,family='poisson')
model3 <- glm(Goals ~OffenseClub + DefenseClub +
OffThuis,data=StartData,family='poisson')
anova (model0,model1,model2,model3,test='Chisq')
Analysis of Deviance Table

Model 1: Goals ~ 1
Model 2: Goals ~ OffenseClub
Model 3: Goals ~ OffenseClub + DefenseClub
Model 4: Goals ~ OffenseClub + DefenseClub + OffThuis
Resid. Df Resid. Dev Df Deviance  Pr(>Chi)
1       611     865.23
2       594     754.17 17  111.064 7.610e-16 ***
3       577     699.13 17   55.043 6.743e-06 ***
4       576     668.96  1   30.172 3.955e-08 ***
It appears that modelling step which makes the model more complex is significant, we must reject the hypothesis that any of these terms is not relevant. Hence the number of goals is dependent on the teams plus a home team effect.

### The twelfth man

It does make a difference who is playing at home. In practical terms, due to the model used, this advantage  is difficult to interpret. In general, when two clubs of equal strength play each other, they each make 1.3 goals.
exp(coef(model2)[1])
(Intercept)
1.346538
When one of these equally strong teams plays away, the other at home, the numbers change. A team playing at home makes 1.6 goals, while playing away only 1.1.
exp(coef(model3)[length(coef(model3))] + coef(model3)[1])
OffThuis
1.58019
exp(coef(model3)[1])
(Intercept)
1.112886
This would make playing away or at home both statistical and practically significant. Note that the size of this effect can not be transferred to other circumstances.

### The teams

Each of the teams has two parameters in the model. These can be most easily be interpreted as offensive and defensive power. The following code plots these powers.
co <- coef(model3)
coO <- co[grep('Offense',names(co))]
coD <- co[grep('Defense',names(co))]
names(coO) <- gsub('OffenseClub','',names(coO))
names(coD) <- gsub('DefenseClub','',names(coD))
# Ado Den Haag is missing in the parameterization. so it is added.
coB <- rbind(cbind(coO,coD),matrix(c(0,0)
,nrow=1,,dimnames=list('Ado Den Haag',c('coO','coD'))))
# scaled for relative strength
coB <- as.data.frame(scale(coB,scale=FALSE))
# -coD to make more defensive power visually larger
plot(-coD ~coO, type='n', data=coB,xlab='Offensive power',ylab='Defensive power',axes=FALSE)
text(-coD ~coO,data=coB,labels=rownames(coB))
abline(a=0,b=1)
abline(v=0)
abline(h=0)
The plot shows the axes, a team close to the centre (NAC Breda, FC Utrecht) was average in both offensive and defensive strength. A diagonal line depicts the equal defense and offense strength region. Hence Feyenoord is equally strong in offense and defense, same for De Graafschap. The line is not quite diagonal, the range in in offense strength is larger than the range in defense strength. The best teams is top right; Ajax. The worst teams are bottom left; De Graafschap and Excelsior have relegated to eerste divisie. A few clubs are noticeable for their mismatch in offensive and defensive strengths. SC Heerenveen has almost the same goal making power as Ajax, but not enough defensive capacity. In contrast, Vitesse won't receive many goals, but lacks the power to make the goals. Overall they have about the same strength.
Otherwise stated; if SC Heerenveen played against itself. ignoring home team advantage, it would probably make two or even three goals.
fbpredict(model2,'SC Heerenveen','SC Heerenveen')[[1]]
SC Heerenveen in rows against SC Heerenveen in columns
0      1      2      3      4      5      6      7      8      9
0 0.0060 0.0153 0.0196 0.0167 0.0107 0.0055 0.0023 0.0009 0.0003 0.0001
1 0.0153 0.0391 0.0501 0.0428 0.0274 0.0140 0.0060 0.0022 0.0007 0.0002
2 0.0196 0.0501 0.0641 0.0548 0.0351 0.0180 0.0077 0.0028 0.0009 0.0003
3 0.0167 0.0428 0.0548 0.0467 0.0299 0.0153 0.0065 0.0024 0.0008 0.0002
4 0.0107 0.0274 0.0351 0.0299 0.0192 0.0098 0.0042 0.0015 0.0005 0.0001
5 0.0055 0.0140 0.0180 0.0153 0.0098 0.0050 0.0021 0.0008 0.0003 0.0001
6 0.0023 0.0060 0.0077 0.0065 0.0042 0.0021 0.0009 0.0003 0.0001 0
7 0.0009 0.0022 0.0028 0.0024 0.0015 0.0008 0.0003 0.0001 0      0
8 0.0003 0.0007 0.0009 0.0008 0.0005 0.0003 0.0001 0      0      0
9 0.0001 0.0002 0.0003 0.0002 0.0001 0.0001 0      0      0      0
If Vitesse played against itself it would make zero or one goal.
fbpredict(model2,'Vitesse','Vitesse')[[1]]
Vitesse in rows against Vitesse in columns
0      1      2      3      4      5      6      7      8      9
0 0.1165 0.1252 0.0673 0.0241 0.0065 0.0014 0.0002 0      0      0
1 0.1252 0.1346 0.0724 0.0259 0.0070 0.0015 0.0003 0      0      0
2 0.0673 0.0724 0.0389 0.0139 0.0037 0.0008 0.0001 0      0      0
3 0.0241 0.0259 0.0139 0.0050 0.0013 0.0003 0.0001 0      0      0
4 0.0065 0.0070 0.0037 0.0013 0.0004 0.0001 0      0      0      0
5 0.0014 0.0015 0.0008 0.0003 0.0001 0      0      0      0      0
6 0.0002 0.0003 0.0001 0.0001 0      0      0      0      0      0
7 0      0      0      0      0      0      0      0      0      0
8 0      0      0      0      0      0      0      0      0      0
9 0      0      0      0      0      0      0      0      0      0

### model extensions

The Residual deviance of model3 is 668.96 on 576 degrees of freedom. That might mean some more effects can be found in the data.

#### twelfth man and teams

The first extension is that home and away advantage is different between teams. Based on these data, this does not seem to be statistically significant.
model4a <- glm(Goals ~OffenseClub*OffThuis + DefenseClub
,data=StartData,family='poisson')
model4b <- glm(Goals ~OffenseClub + DefenseClub*OffThuis
,data=StartData,family='poisson')
model5 <- glm(Goals ~(OffenseClub + DefenseClub)*OffThuis
,data=StartData,family='poisson')
anova (model3,model4a,model5,test='Chisq')
Analysis of Deviance Table

Model 1: Goals ~ OffenseClub + DefenseClub + OffThuis
Model 2: Goals ~ OffenseClub * OffThuis + DefenseClub
Model 3: Goals ~ (OffenseClub + DefenseClub) * OffThuis
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       576     668.96
2       559     649.00 17   19.953   0.2766
3       542     626.77 17   22.236   0.1758
anova (model3,model4b,model5,test='Chisq')
Analysis of Deviance Table

Model 1: Goals ~ OffenseClub + DefenseClub + OffThuis
Model 2: Goals ~ OffenseClub + DefenseClub * OffThuis
Model 3: Goals ~ (OffenseClub + DefenseClub) * OffThuis
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       576     668.96
2       559     647.46 17   21.499   0.2048
3       542     626.77 17   20.690   0.2404

#### Before and after winter break

Winter break has the possibility to change players. It might be, that teams change in quality in this period. In these data, it seems this effect is not statistically significant.
StartData\$year <- factor(c(substr(old\$Datum,1,4),substr(old\$Datum,1,4)))
model6 <- glm(Goals ~OffenseClub + DefenseClub  + year + OffThuis
,data=StartData,family='poisson')
model7 <- glm(Goals ~(OffenseClub + DefenseClub)*year + OffThuis
,data=StartData,family='poisson')
anova (model3,model6,model7,test='Chisq')
Analysis of Deviance Table

Model 1: Goals ~ OffenseClub + DefenseClub + OffThuis
Model 2: Goals ~ OffenseClub + DefenseClub + year + OffThuis
Model 3: Goals ~ (OffenseClub + DefenseClub) * year + OffThuis
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       576     668.96
2       575     668.82  1    0.135   0.7129
3       541     625.48 34   43.345   0.1308

#### 5 comments:

1. Could we do the same with individual players?

1. That would be difficult if not impossible. The smaller problem is getting data. Who played in which matches? The bigger problem is the amount of data per player. Some players may have only few games hence very imprecise results. Other players may have played a lot together, who is causing their combined effect? It is certainly out of the scope I envisioned.

2. i really like your post. did you include the goals of the current matchday (the goals you want to explain) in your independent variables or did you exclude them?

1. At this point I have included all data of season 2011/2012. I am looking at understanding the data, what explains the number of goals? The predictions shown I would not call predictions, rather illustrations of the model and output.
Obviously it is also interesting to add the current season and predict coming matches. That is something I need to make.

2. i tried it in the german context. the models are a lot poorer . ordinal regression worked better than poison models. models on the game level worked also better than models on gameXteam level. i can predict about 30 % of the results (not the goals just the direction) which is not that good. bokies are better ;-)