Saturday, November 24, 2012

Secret Santa - unfinished business

Last week I wrote:
This is actually a more difficult calculation (or I forgot too much probability). Luckily a bit of brute force comes in handy. To reiterate, in general simulated data shows 0.54 redraws because of the first person etc.
colSums(countstop)/nrep
[1] 0.54276 0.40735 0.31383 0.24834 0.20415
In this last week I figured out how to finish this more elegantly/without too much brute force.
To reiterate, the chance within a particular round of drawing names are:
(p.onedraw <- c(colSums(redraw)/nrow(redraw),p.succes) )
[1] 0.200 0.150 0.117 0.0917 0.075 0.367
This means that a 'round' of secret santa with 5 persons can have 6 outcomes: The first person draws self, the second draws self, same for person three , four and five. Sixth outcome: a valid drawing of names. If a person self draws, a next round occurs. This means an infinite number of rounds and drawings may occur. The brute force solution was to follow a lot of solutions and do this again and again. But, we can calculate the probability of of a finish after twice a fail at person 1 and three times fail at person 3. It is the chance of this happening: 0.22*0.1173*0.367 times the number of ways this can happen: (2+3)!/(2!*3!). Obviously, this is a generalization of some of the formulas for binomial. 
Now, to do this for all reasonable possibilities of results. The latter may be obtained e.g. by 
ncalc <- 8 xx <- expand.grid(0:ncalc,0:ncalc,0:ncalc,0:ncalc,0:ncalc)
nrow(xx)
[1] 59049

Implementation phase 1

There are quite a number of probabilities to calculate, and as this will be wholesale, every shortcut is welcome. The main approach is to do the whole calculation on a logarithmic scale and transform back at the end. 
With a bit of preparation log(factorial) is just getting a number out of a vector.
lnumb <- c(0,log(1:40))
clnumb <- cumsum(lnumb)
lfac <- function(x) clnumb[x+1]
exp(lfac(4))
[1] 24
to store logs for chances: 
lp1d <- log(p.onedraw)
Combine it all into functions
lpoccur <- function(x,lpvals) {
  x <- as.numeric(x)
  crossprod(c(x,1),lpvals) + lfac(sum(x))-sum(sapply(x,lfac))
}
poccur <- function(x,lpvals) exp(lpoccur(x,lpvals))
And call it over the matrix xx
chances <- sapply(1:nrow(xx),function(x) poccur(xx[x,],lp1d))
newcalc.res <- as.data.frame(cbind(xx,chances))
newcalc.res <- newcalc.res[order(-newcalc.res$chances),]
head(newcalc.res)
     Var1 Var2 Var3 Var4 Var5    chances
1       0    0    0    0    0 0.36666667
2       1    0    0    0    0 0.07333333
10      0    1    0    0    0 0.05500000
82      0    0    1    0    0 0.04277778
730     0    0    0    1    0 0.03361111
6562    0    0    0    0    1 0.02750000

Implementation phase 2

The idea was to not store matrix xx, but rather generate the rows on the fly, because I thought xx would be too large. But is seems that is not needed and a suitable xx fits into memory. 
sum(chances)
[1] 0.9998979
Hence the calculation can be vectorised a bit more, reducing calculation time again:
ntrial <- rowSums(xx)
accu <- lfac(ntrial) + crossprod(t(as.matrix(xx)),lp1d[1:(length(lp1d)-1)])
for (i in 1:ncol(xx) ) accu <- accu-lfac(xx[,i])
accu <- accu+lp1d[length(lp1d)]
chance2 <- exp(accu)
newcalc2.res <- cbind(xx,chance2)
newcalc2.res <- newcalc2.res[order(-newcalc2.res$chance2),]
So, the values from simulation are recreated
sum(newcalc2.res$Var1*newcalc2.res$chance2)
[1] 0.5445788
sum(newcalc2.res$Var5*newcalc2.res$chance2)
[1] 0.2044005
Calculation done! In the end it was not so much a difficult or long calculation. However, a simulation is much easy to construct and clearly pointed out directions.

Sunday, November 18, 2012

Secret Santa - again

Based on comments by cellocgw I decided to look at last week's Secret Santa again. This time, the moment a person, whoever that is, draws his/her own name, the drawing starts again at the first person.

Introduction

A group of n persons draws sequentially names for Secret Santa. Each person may not draw his/her own name. If a person draws his/her own name then all names are returned and the drawing starts again. Questions are such as: How often do you draw names?

Simulation 

To understand the problem a simulation is build. The simulation draws the names, with restarts if needed. These are placed into outcome. On top of that it counts where a person self-draws.
selsan1 <- function(persons) {
  startoutcome <- rep(0,persons)
  countstop <- rep(0,persons)
  outcome <- startoutcome
  done <- FALSE
  possible <- 1:persons
  who <- 1           
  while(!done) {
    remaining <- possible[!(possible %in% outcome)]
    if (length(remaining)==1) {
      if (who==remaining) {
        countstop[who] <- countstop[who]+1
        outcome <- startoutcome
        who <- 1
      } else {
        outcome[who] <- remaining
        who <- who+1
        done <- TRUE
      }
    }     else {
      select <- sample(possible[!(possible %in% outcome)],1)
      if (who==select) {
        countstop[who] <- countstop[who]+1
        outcome <- startoutcome
        who <- 1
      } else {
        outcome[who] <- select
        who <- who+1
      }
    }
  }     
  return(list(outcome=outcome,countstop=countstop))
}
In an unlucky draw for five persons there were four restarts, at person 1, 2, 3 and 5.
selsan1(5)
$outcome
[1] 2 1 5 3 4

$countstop
[1] 1 1 1 0 1
The aim of the simulation is to do this a lot of times and draw conclusions from there. 
nrep=1e5
simulations <- lapply(1:nrep,function(x) selsan1(5))
The first question concerns the outcomes. There are 44 allowed results. They are obtained about equally often.
outcomes <- sapply(simulations,function(x) paste(x$outcome,collapse=' '))
as.data.frame(table(outcomes))
    outcomes Freq
1  2 1 4 5 3 2301
2  2 1 5 3 4 2263
3  2 3 1 5 4 2192
4  2 3 4 5 1 2323
5  2 3 5 1 4 2268
6  2 4 1 5 3 2230
7  2 4 5 1 3 2280
8  2 4 5 3 1 2242
9  2 5 1 3 4 2203
10 2 5 4 1 3 2247
11 2 5 4 3 1 2327
12 3 1 2 5 4 2321
13 3 1 4 5 2 2243
14 3 1 5 2 4 2183
15 3 4 1 5 2 2346
16 3 4 2 5 1 2264
17 3 4 5 1 2 2216
18 3 4 5 2 1 2269
19 3 5 1 2 4 2301
20 3 5 2 1 4 2359
21 3 5 4 1 2 2301
22 3 5 4 2 1 2206
23 4 1 2 5 3 2291
24 4 1 5 2 3 2249
25 4 1 5 3 2 2263
26 4 3 1 5 2 2321
27 4 3 2 5 1 2265
28 4 3 5 1 2 2318
29 4 3 5 2 1 2329
30 4 5 1 2 3 2220
31 4 5 1 3 2 2309
32 4 5 2 1 3 2228
33 4 5 2 3 1 2291
34 5 1 2 3 4 2284
35 5 1 4 2 3 2314
36 5 1 4 3 2 2304
37 5 3 1 2 4 2339
38 5 3 2 1 4 2213
39 5 3 4 1 2 2285
40 5 3 4 2 1 2194
41 5 4 1 2 3 2255
42 5 4 1 3 2 2336
43 5 4 2 1 3 2218
44 5 4 2 3 1 2289
The number of times a redraw is taken can also be obtained. 
countstop <- t(sapply(simulations,function(x) x$countstop))
table(rowSums(countstop))/nrep
      0       1       2       3       4       5       6       7       8       9 
0.36938 0.23036 0.14788 0.09337 0.05929 0.03673 0.02220 0.01440 0.00988 0.00603 
     10      11      12      13      14      15      16      17      18      19 
0.00396 0.00240 0.00168 0.00089 0.00053 0.00041 0.00021 0.00014 0.00012 0.00005 
     20      21      22      28 
0.00004 0.00003 0.00001 0.00001 
We can also extract where the redraws occur. In general there is 0.5 redraw because of the first person, 0.4 because of the second, etc. The numbers do not add to 1, they are not chances. 
colSums(countstop)/nrep
[1] 0.54276 0.40735 0.31383 0.24834 0.20415

Calculations

The question is, can we achieve the same with a calculation. For this we obtain the chances of various results. For this we build three matrices. All permutations in pp. Continuation of a sequence is in permitted. Finally, redraw contains the person where a person causes a new draw. Trick here is that if person 2 causes a redraw, then no subsequent persons cause a redraw, hence only 2 is marked in redraw.
pp <- randtoolbox::permut(nn)
redraw <- matrix(FALSE,nrow(pp),nn)
permitted <- redraw
redraw[,1] <- pp[,1] ==1
permitted[,1] <- pp[,1]!=1

for(i in 2:nn) {
  permitted[,i] <- pp[,i]!=i & permitted[,i-1]
  redraw[,i] <- pp[,i] == i & permitted[,i-1]
}
The sequences start like this.
head(pp)
     i i i    
[1,] 5 4 3 1 2
[2,] 5 4 3 2 1
[3,] 5 4 1 3 2
[4,] 5 4 2 3 1
[5,] 5 4 1 2 3
[6,] 5 4 2 1 3
head(permitted)
     [,1] [,2]  [,3]  [,4]  [,5]
[1,] TRUE TRUE FALSE FALSE FALSE
[2,] TRUE TRUE FALSE FALSE FALSE
[3,] TRUE TRUE  TRUE  TRUE  TRUE
[4,] TRUE TRUE  TRUE  TRUE  TRUE
[5,] TRUE TRUE  TRUE  TRUE  TRUE
[6,] TRUE TRUE  TRUE  TRUE  TRUE
head(redraw)
      [,1]  [,2]  [,3]  [,4]  [,5]
[1,] FALSE FALSE  TRUE FALSE FALSE
[2,] FALSE FALSE  TRUE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE

Chance of succes

The chance of a success drawing is the mean of the last column in permitted. Below a comparison with the simulation result. First the observed proportions.
byrow <- as.data.frame(table(rowSums(countstop))/nrep)
head(byrow)
  Var1    Freq
1    0 0.36938
2    1 0.23036
3    2 0.14788
4    3 0.09337
5    4 0.05929
6    5 0.03673
Now the matching calculation. The numbers can be calculated easily.
(p.succes <- mean(permitted[,nn]))
[1] 0.3666667
byrow$n <- as.numeric(levels(byrow$Var1)[byrow$Var1])
byrow$p <- sapply(byrow$n,function(x) p.succes*(1-p.succes)^x)
byrow[,c(1,3,4,2)]
   Var1  n            p    Freq
1     0  0 3.666667e-01 0.36938
2     1  1 2.322222e-01 0.23036
3     2  2 1.470741e-01 0.14788
4     3  3 9.314691e-02 0.09337
5     4  4 5.899305e-02 0.05929
6     5  5 3.736226e-02 0.03673
7     6  6 2.366277e-02 0.02220
8     7  7 1.498642e-02 0.01440
9     8  8 9.491398e-03 0.00988
10    9  9 6.011219e-03 0.00603
11   10 10 3.807105e-03 0.00396
12   11 11 2.411167e-03 0.00240
13   12 12 1.527072e-03 0.00168
14   13 13 9.671458e-04 0.00089
15   14 14 6.125256e-04 0.00053
16   15 15 3.879329e-04 0.00041
17   16 16 2.456908e-04 0.00021
18   17 17 1.556042e-04 0.00014
19   18 18 9.854933e-05 0.00012
20   19 19 6.241457e-05 0.00005
21   20 20 3.952923e-05 0.00004
22   21 21 2.503518e-05 0.00003
23   22 22 1.585561e-05 0.00001
24   28 28 1.023239e-06 0.00001

Where fall the redraws

This is actually a more difficult calculation (or I forgot too much probability). Luckily a bit of brute force comes in handy. To reiterate, in general simulated data shows 0.54 redraws because of the first person etc.
colSums(countstop)/nrep
[1] 0.54276 0.40735 0.31383 0.24834 0.20415
So, what happens in a drawing? The outcomes follow from the matrix redraw. There is 20% chance the first person draws 1, 25% chance the second person draws a 2 etc. Finally, as established, the chance is 36% to have a good draw.
(p.onedraw <- c(colSums(redraw)/nrow(redraw),p.succes) )
[1] 0.20000000 0.15000000 0.11666667 0.09166667 0.07500000 0.36666667
The function below takes these numbers and a locator vector to return a data frame with chances, location of fail and success status in column finish
one.draw <- function(status.now,p.now,p.onedraw) {
  la <- lapply(1:(nn+1),function(x) {
        status <- status.now
        if (x>nn) finish=TRUE
        else {          
          status[x] <- status[x] +1
          finish=FALSE}
        list(status=status,p=p.onedraw[x],finish=finish)
      })
  res <- as.data.frame(do.call(rbind, lapply(la,function(x) x$status)))
  res$p <- sapply(la,function(x) x$p*p.now)
  res$finish <- sapply(la,function(x) x$finish)
  res
}
status.now <- rep(0,nn)
names(status.now) <- paste('person',1:5,sep='')
od <- one.draw(status.now,1,p.onedraw)
od
  person1 person2 person3 person4 person5          p finish
1       1       0       0       0       0 0.20000000  FALSE
2       0       1       0       0       0 0.15000000  FALSE
3       0       0       1       0       0 0.11666667  FALSE
4       0       0       0       1       0 0.09166667  FALSE
5       0       0       0       0       1 0.07500000  FALSE
6       0       0       0       0       0 0.36666667   TRUE
Where the sequence is not finished, the same chances apply again. For this a second function is build. Same outcomes are combined to restrict the number of outcomes.
one.draw.wrap <- function(x,p.pnedraw) {
  if (x$finish) return(x)
  one.draw(x[grep('person',names(x))],x$p,p.onedraw)
  }
  
new.draw <- function(od,p.onedraw) {
  todo <- od[!od$finish,]
  done <- od[od$finish,]
  la <- lapply(1:nrow(todo),function(x) one.draw.wrap(todo[x,],p.onedraw))
  snd <- do.call(rbind,la)
  snd <- snd[do.call(order,snd),]
  i <- 1
  while(i<nrow(snd) ) {
    if(all(snd[i,!(colnames(snd)=='p')] == snd[i+1,!(colnames(snd)=='p')])) {
      snd$p[i] <- snd$p[i] + snd$p[i+1]
      snd <- snd[-(i+1),]
    }
    else i=i+1
  }
  snd <- rbind(snd, done)
  snd[order(-snd$p),]
head(new.draw(od))

   person1 person2 person3 person4 person5          p finish
61       0       0       0       0       0 0.36666667   TRUE
6        1       0       0       0       0 0.07333333   TRUE
2        1       1       0       0       0 0.06000000  FALSE
25       0       1       0       0       0 0.05500000   TRUE
3        1       0       1       0       0 0.04666667  FALSE
35       0       0       1       0       0 0.04277778   TRUE
There are still quite some unfinished drawings, so we can redo that a few times. This is where a fast processor is practical.
myfin <- list(od=od)
for (i in 1:15) myfin[[i+1]] <- new.draw(myfin[[i]])
Even after 15 steps there are some unfinished sequences. Nevertheless I stop here.
xtabs(p ~finish,data=myfin[[15]])
finish
      FALSE        TRUE 
0.001057999 0.998942001 
In the finished series there are in on average 0.54 redraws at person 1. That is pretty close to 0.54 from the simulation.
with(myfin[[15]],sum(person1[finish]*p[finish]))
[1] 0.5398659
However, the 0.1% unfinished sequences would contribute a lot. Even if they were finished at step 15.
with(myfin[[15]],sum(person1*p))
0.5448775


Saturday, November 10, 2012

Secret Santa

On reddit somebody asked For n individuals, what's the probability that the last person to pick during a round of Secret Santa name picking, will pick their own name.. "With each person picking in turn, and re-picking if they pick out their own name.Two simulations were proposed. After some thinking I think it can be calculated rather easily.

Problem consideration

I think, it is possible to estimate the chance five people together draw any sequence. It is also possible to enumerate all sequences. Categorize the sequences, add their chances. Problem solved.

So, for example the sequence 2 5 1 3 4.
The first person has has 1/4 chance to select #2, as 2, 3, 4 and 5 can be selected with equal chances.
The second person has 1/4 chance to select #5, as 1, 3, 4 and 5 can be selected with equal chances.
The third person has 1/2 chance to select #1, as 1 or 4 can be selected.
The fourth person selects 3, as there is no alternative (4 results in a repick).
The fifth person selects 4, being the remaining number.
Chance to get this sequence: 1/32.
It is possible to recurse through all these trees, but why? After all we can easily enumerate all these sequences.
nn <-5
pp <- randtoolbox::permut(nn)
for(i in 1:(nn-1)) pp<- pp[!(pp[,i]==i),]
head(pp)
     i i i    
[1,] 5 4 1 3 2
[2,] 5 4 2 3 1
[3,] 5 4 1 2 3
[4,] 5 4 2 1 3
[5,] 5 3 4 1 2
[6,] 5 3 4 2 1
Then, the chances. For each column it is just 1/#possible_selections. That is one(!) statement.
la <- lapply(1:(nn-1),function(x) 1/rowSums(pp[,x:nn]!=x))
str(la)

List of 4
 $ : num [1:53] 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 ...
 $ : num [1:53] 0.333 0.333 0.333 0.333 0.333 ...
 $ : num [1:53] 0.5 0.5 0.5 0.5 0.333 ...
 $ : num [1:53] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 1 ...
Having these chances, they must be multiplied
prob <- Reduce('*',la)
sum(prob)
[1] 1
Luckily they add to 1. So, what is the chance the last person draws a 5? 0.13, exactly the same as a simulation number found on reddit. 
sum(prob[pp[,nn]==nn])
[1] 0.1319444

To get a bit more results, we can put it in a function and get the answers up to 9 persons:

secsan1 <- function(nn) {
  pp <- randtoolbox::permut(nn)
  for(i in 1:(nn-1)) pp<- pp[!(pp[,i]==i),]
  la <- lapply(1:(nn-1),function(x) 1/rowSums(pp[,x:nn]!=x))
  prob <- Reduce('*',la)
  sum(prob[pp[,nn]==nn])
}
data.frame(n=3:9,sapply(3:9,secsan1))
  n sapply.3.9..secsan1.
1 3           0.25000000
2 4           0.13888889
3 5           0.13194444
4 6           0.11277778
5 7           0.10053241
6 8           0.09049461
7 9           0.08238237
These numbers are close to one of the simulations.

The second simulation shows the chances the last person has to draw a person.
secsan2 <- function(nn) {
  pp <- randtoolbox::permut(nn)
  for(i in 1:(nn-1)) pp<- pp[!(pp[,i]==i),]
  la <- lapply(1:(nn-1),function(x) 1/rowSums(pp[,x:nn]!=x))
  prob <- Reduce('*',la)
  xtabs(prob ~ pp[,nn])
}
as.data.frame(secsan2(8))
  pp...nn.       Freq
1        1 0.10485639
2        2 0.10729167
3        3 0.11072279
4        4 0.11591789
5        5 0.12471088
6        6 0.14283588
7        7 0.20316988
8        8 0.09049461
These numbers are not equal to simulation numbers on Reddit. This is because this simulation contains an error. It reads:
selsan <- function(who,persons) {
  if (length(persons)==1) return(persons)
  sel <- sample(persons[persons!=who],1)
  return(c(sel,selsan(who+1,persons[persons!=sel])))
}
#selsan(1,1:5)
finselsan <- function(n){
  selsan(1,1:n)[n]
}
nrep=1e4
sa <- sapply(1:nrep,function(x) finselsan(8))
table(sa)/nrep
What happens is that sample() is pulling a trick on us. Most of the time sample() gets some values and selects one of those. It is possible however, that the second last person has also himself to draw. At which point sample thinks only one number is available. To quote the manual: If x has length 1, is numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes place from 1:x. Note that this convenience feature may lead to undesired behaviour when x is of varying length in calls such as sample(x). See the examples. It is even documented. But who reads these all the time? A corrected simulation avoids this:
selsan <- function(who,persons) {

  if (length(persons)==1) return(persons)
  if (length(persons[persons!=who])==1) return(c(persons[persons!=who],who))
  sel <- sample(persons[persons!=who],1)
  return(c(sel,selsan(who+1,persons[persons!=sel])))
}
finselsan <- function(n){
  selsan(1,1:n)[n]
}
nrep=1e4
sa <- sapply(1:nrep,function(x) finselsan(8))
as.data.frame(table(sa)/nrep)
  sa   Freq
1  1 0.1033
2  2 0.1024
3  3 0.1108
4  4 0.1123
5  5 0.1270
6  6 0.1498
7  7 0.2024
8  8 0.0920
This shows the exact calculation gives similar results as two simulations.

Extensions

What are the chances for any person to get anybody? It is simple now.  
nn <- 6
pp <- randtoolbox::permut(nn)
for(i in 1:(nn-1)) pp<- pp[!(pp[,i]==i),]
la <- lapply(1:(nn-1),function(x) 1/rowSums(pp[,x:nn]!=x))
prob <- Reduce('*',la)
sapply(1:nn,function(x) xtabs(prob ~ factor(pp[,x],levels=1:nn)))
  [,1] [,2]      [,3]      [,4]      [,5]      [,6]
1  0.0 0.24 0.2250000 0.2083333 0.1875000 0.1391667
2  0.2 0.00 0.2375000 0.2194444 0.1972222 0.1458333
3  0.2 0.19 0.0000000 0.2388889 0.2138889 0.1572222
4  0.2 0.19 0.1791667 0.0000000 0.2500000 0.1808333
5  0.2 0.19 0.1791667 0.1666667 0.0000000 0.2641667
6  0.2 0.19 0.1791667 0.1666667 0.1513889 0.1127778
The persons are columns, the rows are presents. Obviously 0 on the diagonal, except the last person. Closer to the last person, the chances get more unequal.

Conclusion

Don't do this. Get everybody to draw, then determine if a redraw is needed.

Sunday, November 4, 2012

Finishing football postings

For now this is the last post about these football data. It started in August, by now it is November. But just to finish up; the model as it should have been last week.

Model

As most of what I did is described last week, only the model as it went in Jags is shown.
Explanation: Each team has a total strength (TStr) and an attack/defense strength (AD). These combine to two numbers, Defense strength (DStr) and Attack strength (AStr). The attack strength against defense strength together determine the number of goals. Logically each team has one TStr, AD, DStr and Astr. In the model this is not so, Astr and Dstr are calculated for each game. 
The intention was to build in strategy. If a team plays a much stronger/weaker team, it might be possible to shift in attack/defense strength. Unfortunately, that did not work. Either because it does not happen or because my formulation did not work out so good.
A second possibility possible in this framework is to incorporate statements like; team X never wins when they play at time Y. That could translate to a change in TStr depending on day and time. Maybe later.
fbmodel1 <- function() {
  for (i in 1:N) {
    HomeMadeGoals[i] ~ dpois(HS[i])
    OutMadeGoals[i]  ~ dpois(OS[i])
    log(HS[i]) <- Home  + DstrO[i] + AstrH[i]
    log(OS[i]) <-         DstrH[i] + AstrO[i] 
    AstrH[i] <- TStr[HomeClub[i]]+AD[HomeClub[i]]
    DstrH[i] <- TStr[HomeClub[i]]-AD[HomeClub[i]]
    AstrO[i] <- TStr[OutClub[i]] +AD[OutClub[i]]
    DstrO[i] <- TStr[OutClub[i]] -AD[OutClub[i]]
  }
  TStr[1] <- 0
  AD[1] <- 0
  for (i in 2:nclub) {
    TStr[i] ~ dnormmix(MT,tauT1,EtaT)
    AD[i]   ~ dnormmix(MAD ,tauAD1, EtaAD)
  }
  for (i in 1:3) {
    MT[i]   ~ dnorm(0,.01)
    MAD [i] ~ dnorm(0,.01)
    tauT1[i]  <- tauT
    tauAD1[i] <- tauAD
    eee[i] <- 3
  }
  EtaT[1:3]  ~ ddirch(eee[1:3])
  EtaAD[1:3] ~ ddirch(eee[1:3])
  sigmaT <- 1/sqrt(tauT)
  tauT   ~ dgamma(.001,.001)
  sigmaAD  <- 1/sqrt(tauAD)
  tauAD ~ dgamma(.001,.001)
  Home ~ dnorm(0,.0001)
}

Results

As estimates for ADO Den Haag are now fixed, I am not sure any of the estimates can be interpreted as such. So, this is just to show the model runs through JAGS. 
Inference for Bugs model at "C:/Users/.../Rtmp4uyOHL/model1be0596d2d4b.txt", fit using jags,
 3 chains, each with 10000 iterations (first 5000 discarded), n.thin = 5
 n.sims = 3000 iterations saved
          mu.vect sd.vect     2.5%      25%      50%      75%    97.5%  Rhat n.eff
AD[1]       0.000   0.000    0.000    0.000    0.000    0.000    0.000 1.000     1
AD[2]       0.676   0.132    0.421    0.585    0.676    0.763    0.935 1.004  1100
AD[3]       0.534   0.133    0.271    0.447    0.534    0.624    0.795 1.001  3000
AD[4]      -0.005   0.134   -0.261   -0.098   -0.006    0.089    0.261 1.002  1300
AD[5]      -0.084   0.140   -0.362   -0.179   -0.083    0.012    0.192 1.004   770
AD[6]       0.122   0.133   -0.143    0.035    0.121    0.210    0.388 1.002  1500
AD[7]       0.547   0.130    0.288    0.463    0.545    0.636    0.795 1.002  2600
AD[8]       0.250   0.133   -0.010    0.159    0.252    0.340    0.517 1.002  1100
AD[9]       0.555   0.131    0.298    0.469    0.555    0.641    0.802 1.001  2300
AD[10]      0.198   0.133   -0.073    0.111    0.197    0.286    0.466 1.002  1300
AD[11]      0.199   0.134   -0.061    0.110    0.195    0.289    0.468 1.003  1700
AD[12]      0.244   0.135   -0.019    0.154    0.243    0.334    0.516 1.003  2800
AD[13]      0.563   0.129    0.309    0.479    0.561    0.650    0.814 1.002  3000
AD[14]      0.196   0.135   -0.069    0.105    0.195    0.286    0.461 1.003   940
AD[15]      0.168   0.130   -0.081    0.080    0.166    0.255    0.413 1.003  1300
AD[16]      0.428   0.132    0.165    0.340    0.427    0.519    0.683 1.002  1600
AD[17]      0.321   0.140    0.054    0.225    0.317    0.413    0.604 1.002  3000
AD[18]      0.026   0.131   -0.227   -0.065    0.027    0.114    0.277 1.001  3000
Home        0.358   0.062    0.237    0.315    0.360    0.401    0.477 1.001  3000
TStr[1]     0.000   0.000    0.000    0.000    0.000    0.000    0.000 1.000     1
TStr[2]     0.186   0.082    0.012    0.134    0.189    0.241    0.346 1.008   320
TStr[3]     0.036   0.084   -0.126   -0.017    0.032    0.091    0.204 1.001  3000
TStr[4]     0.095   0.086   -0.067    0.032    0.094    0.158    0.262 1.003   840
TStr[5]     0.045   0.088   -0.126   -0.012    0.044    0.102    0.219 1.001  3000
TStr[6]     0.066   0.082   -0.092    0.009    0.066    0.123    0.224 1.001  2300
TStr[7]     0.209   0.080    0.044    0.158    0.211    0.262    0.361 1.001  2000
TStr[8]     0.147   0.084   -0.020    0.089    0.151    0.206    0.303 1.001  3000
TStr[9]     0.080   0.083   -0.077    0.021    0.077    0.137    0.241 1.002  1900
TStr[10]    0.153   0.084   -0.012    0.095    0.155    0.212    0.310 1.001  2400
TStr[11]    0.054   0.082   -0.106   -0.001    0.052    0.110    0.214 1.001  3000
TStr[12]   -0.010   0.083   -0.188   -0.061   -0.007    0.045    0.141 1.001  3000
TStr[13]    0.235   0.078    0.086    0.183    0.235    0.288    0.397 1.002  3000
TStr[14]   -0.002   0.084   -0.179   -0.055    0.000    0.054    0.152 1.003   960
TStr[15]    0.215   0.078    0.054    0.165    0.215    0.266    0.370 1.002  2000
TStr[16]    0.268   0.080    0.118    0.213    0.265    0.320    0.437 1.004   910
TStr[17]    0.008   0.082   -0.158   -0.044    0.007    0.061    0.165 1.001  3000
TStr[18]    0.164   0.083   -0.001    0.108    0.166    0.220    0.320 1.002  1400
sigmaAD     0.172   0.077    0.044    0.112    0.168    0.225    0.329 1.001  3000
sigmaT      0.081   0.041    0.025    0.050    0.075    0.105    0.176 1.015   140
deviance 1886.560   8.367 1872.324 1880.527 1886.032 1891.708 1904.662 1.003   680

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 34.9 and DIC = 1921.5
DIC is an estimate of expected predictive error (lower deviance is better).

Sunday, October 28, 2012

Mixed distribution

In the football data there was some reason to use a mixed distribution in the football data (ref) so I tried doing that. It was more difficult than I expected. Not only is a mixture of distributions fairly difficult, also the system was over parameterized, causing grief.

mixed distribution

data from a mixed distribution is simply said data which is from a combination of distributions. Most obvious example is length of people; males are longer than females. Combine their lengths and a joint non normal distribution appears. Separate them, two normal distributions. This problem is equivalent to the eyes problem in winbugs. Except that I use JAGS and it was supposed to be part of a larger model. The example section in JAGS shows two ways how to do this. In the end I chose to use the dnormmix distribution in the mix module. To quote the JAGS manual: 'The mix module defines a novel distribution dnormmix(mu,tau,pi) representing a finite mixture of normal distributions' and 'If you want to use the dnormmix distribution but do not care about label switching, then you can disable the tempered transition sampler with
set factory "mix::TemperedMix" off, type(sampler)'
That is me, so here is how to do that in R:
library(R2jags)
Loading required package: coda
Loading required package: lattice
Loading required package: R2WinBUGS
Loading required package: rjags
linking to JAGS 3.3.0
module basemod loaded
module bugs loaded
Loading required package: abind
Loading required package: parallel

Attaching package: 'R2jags'

The following object(s) are masked from 'package:coda':

traceplot

load.module("mix")
module mix loaded
set.factory("mix::TemperedMix", 'sampler', FALSE)
NULL
list.factories('sampler')
factory status
1 mix::TemperedMix FALSE
2 bugs::DSum TRUE
3 bugs::Conjugate TRUE
4 bugs::Dirichlet TRUE
5 bugs::MNormal TRUE
6 base::Finite TRUE
7 base::Slice TRUE

football model

This is the previous model with dnormmix grafted in and overall strength with attack/defense removed in an attempt to simplify and understand the estimation problems . Note that Astr[1] is fixed to 1, Dstr[1] is almost fixed to 0 and all other AStr and DStr are free. I am not 100% sure about those parts. Astr needs to be fixed because otherwise the whole set of Astr[] and Dstr[] changes in a similar way over the MCMC samples. It is their relative numbers that count. Somehow, there is still a need to fix Dstr[1], as there is still an overall change. Yet, I have the feeling fixing Astr[1] gives information to determine Dstr[2:nclub] hence Astr[2:nclub] hence Dstr[1]. Yet, they all move similar, and Dstr[1] is clearly not 0 as suggested by the prior.
fbmodel1 <- function() {
  for (i in 1:N) {
    HomeMadeGoals[i] ~ dpois(HS[i])
    OutMadeGoals[i] ~ dpois(OS[i])
    log(HS[i]) <- Home  + Dstr[OutClub[i]] + Astr[HomeClub[i]]
    log(OS[i]) <-        Dstr[HomeClub[i]] + Astr[OutClub[i]] 
  }
  Astr[1] <- 1
  Dstr[1] ~ dnorm(0,10)
  for (i in 2:nclub) {
    Astr[i] ~  dnormmix(MAStr,tauAStr1,EtaAStr1)
    Dstr[i] ~  dnormmix(MDStr,tauDStr1,EtaDStr1)
  }
  for (i in 1:3) {
    MAStr[i] ~ dnorm(0,.01)
    MDStr[i] ~ dnorm(0,.01)
    tauDStr1[i] <- tauDStr
    tauAStr1[i] <- tauAStr
    eee[i] <- 3
  }
  EtaAStr1[1:3] ~ ddirch(eee[1:3])
  EtaDStr1[1:3]  ~ ddirch(eee[1:3])
  sigmaAstr <- 1/sqrt(tauAStr)
  tauAStr ~ dgamma(.001,.001)
  sigmaDstr <- 1/sqrt(tauDStr)
  tauDStr ~ dgamma(.001,.001)
  Home ~ dnorm(0,.0001)
}
params <- c("Dstr","Astr","sigmaAstr","sigmaDstr","Home")
inits <- function(){
    list(TopStr=rnorm(JAGSData$nclub),
         AD=rnorm(JAGSData$nclub),
         sigmaAD=runif(1),
         sigmaStr=runif(0,.5),
         Home=rnorm(1)
         )
}
jagsfit <- jags(JAGSData, model=fbmodel1, inits=inits, 
                parameters=params,progress.bar="gui",
                n.iter=5000)
jagsfit

Inference for Bugs model at "C:/Users/.../RtmpSGssTD/modeld94a9b3849.txt", fit using jags,
 3 chains, each with 15000 iterations (first 7500 discarded), n.thin = 7
 n.sims = 3216 iterations saved
           mu.vect sd.vect     2.5%      25%      50%      75%    97.5%  Rhat n.eff
Astr[1]      1.000   0.000    1.000    1.000    1.000    1.000    1.000 1.000     1
Astr[2]      1.641   0.160    1.338    1.533    1.639    1.748    1.971 1.033    69
Astr[3]      1.328   0.196    0.943    1.196    1.324    1.463    1.710 1.023    91
Astr[4]      0.878   0.185    0.513    0.753    0.880    1.007    1.228 1.021   120
Astr[5]      0.753   0.207    0.326    0.617    0.755    0.894    1.143 1.017   140
Astr[6]      0.949   0.177    0.598    0.836    0.950    1.065    1.304 1.021   110
Astr[7]      1.559   0.158    1.257    1.450    1.559    1.669    1.875 1.032    73
Astr[8]      1.167   0.193    0.804    1.034    1.161    1.293    1.566 1.022   120
Astr[9]      1.426   0.180    1.078    1.304    1.429    1.551    1.756 1.029    84
Astr[10]     1.117   0.187    0.764    0.990    1.114    1.237    1.499 1.024    99
Astr[11]     1.006   0.179    0.660    0.889    1.003    1.121    1.368 1.024    98
Astr[12]     0.955   0.181    0.603    0.830    0.957    1.073    1.312 1.024    91
Astr[13]     1.599   0.159    1.296    1.490    1.599    1.712    1.919 1.034    71
Astr[14]     0.927   0.179    0.566    0.809    0.928    1.053    1.277 1.025    90
Astr[15]     1.178   0.195    0.810    1.046    1.174    1.308    1.574 1.016   160
Astr[16]     1.539   0.164    1.216    1.429    1.538    1.656    1.854 1.034    70
Astr[17]     1.040   0.183    0.693    0.917    1.039    1.154    1.414 1.019   130
Astr[18]     0.968   0.180    0.614    0.849    0.971    1.090    1.315 1.029    91
Dstr[1]     -0.640   0.162   -0.977   -0.747   -0.638   -0.530   -0.331 1.031    79
Dstr[2]     -1.183   0.179   -1.533   -1.301   -1.183   -1.062   -0.844 1.034    66
Dstr[3]     -1.203   0.181   -1.568   -1.323   -1.196   -1.083   -0.853 1.023    97
Dstr[4]     -0.709   0.160   -1.011   -0.814   -0.710   -0.604   -0.400 1.040    58
Dstr[5]     -0.697   0.162   -1.020   -0.804   -0.697   -0.591   -0.375 1.050    46
Dstr[6]     -0.840   0.177   -1.208   -0.956   -0.833   -0.719   -0.508 1.031    70
Dstr[7]     -1.054   0.189   -1.413   -1.191   -1.055   -0.927   -0.675 1.039    62
Dstr[8]     -0.869   0.181   -1.237   -0.987   -0.862   -0.743   -0.538 1.023    91
Dstr[9]     -1.177   0.177   -1.533   -1.293   -1.178   -1.059   -0.832 1.022   100
Dstr[10]    -0.821   0.169   -1.174   -0.929   -0.812   -0.704   -0.503 1.022    98
Dstr[11]    -0.944   0.188   -1.307   -1.074   -0.935   -0.812   -0.593 1.023    95
Dstr[12]    -1.092   0.177   -1.433   -1.216   -1.092   -0.972   -0.739 1.024    87
Dstr[13]    -1.030   0.188   -1.386   -1.162   -1.036   -0.900   -0.650 1.023   100
Dstr[14]    -1.031   0.189   -1.384   -1.160   -1.028   -0.899   -0.665 1.035    64
Dstr[15]    -0.735   0.164   -1.065   -0.843   -0.734   -0.624   -0.418 1.042    54
Dstr[16]    -0.836   0.178   -1.212   -0.952   -0.830   -0.714   -0.500 1.025    88
Dstr[17]    -1.116   0.178   -1.456   -1.237   -1.121   -0.998   -0.758 1.042    55
Dstr[18]    -0.681   0.163   -1.005   -0.791   -0.681   -0.571   -0.367 1.041    54
Home         0.334   0.062    0.210    0.293    0.335    0.376    0.453 1.012   170
sigmaAstr    0.185   0.103    0.042    0.102    0.169    0.254    0.410 1.014   220
sigmaDstr    0.135   0.079    0.028    0.067    0.123    0.190    0.301 1.033    68
deviance  1889.263   8.560 1873.823 1883.430 1888.635 1894.484 1907.580 1.001  2400

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 36.6 and DIC = 1925.9
DIC is an estimate of expected predictive error (lower deviance is better).