Saturday, May 18, 2013

Bubble sort tuning

I was reading Paul Hiemsta's blogpost on Much more efficient bubble sort in R using the Rcpp and inline packages, went back to his first post  Bubble sort implemented in pure R and thought, surely we can do it better in pure R. So I cleaned inner loops, removed a recursion made some variations and finally made a big improvement as I vectorized it. I don't need to try Rcpp to know that is and will be faster, but surely closed the gap a bit.

Original code

This is how it was originally in the blog in pure R. If there are changes it is layout.

larger = function(pair) {
  if(pair[1] > pair[2]) return(TRUE) else return(FALSE)
}
swap_if_larger = function(pair) {
  if(larger(pair)) {
    return(rev(pair)) 
  } else {
    return(pair)
  }
}
swap_pass = function(vec) { 
  for(i in seq(1, length(vec)-1)) {
    vec[i:(i+1)] = swap_if_larger(vec[i:(i+1)])
  }
  return(vec)
}
bubble_sort = function(vec) {
  new_vec = swap_pass(vec)
  if(isTRUE(all.equal(vec, new_vec))) { 
    return(new_vec) 
  } else {
    return(bubble_sort(new_vec))
  }
}

Improve inside of loops

This is essentially the same code, but cleaned the first two functions. 
larger1 = function(pair) pair[1] > pair[2]

swap_if_larger1 = function(pair) {
  if(larger1(pair)) rev(pair) 
  else pair
}

swap_pass1 = function(vec) { 
  for(i in seq(1, length(vec)-1)) {
    vec[i:(i+1)] = swap_if_larger1(vec[i:(i+1)])
  }
  return(vec)
}

bubble_sort1 = function(vec) {
  new_vec = swap_pass1(vec)
  if(isTRUE(all.equal(vec, new_vec))) { 
    return(new_vec) 
  } else {
    return(bubble_sort1(new_vec))
  }
}

Improve outside loops

I cleaned the outside loop, then decided that the two inside functions were not needed any more, because it was reduced to one statement.
swap_pass2 = function(vec) { 
  for(i in 1:(length(vec)-1)) {
    if (vec[i] > vec[i+1] ) vec[c(i,i+1)] <- vec[c(i+1,i)]
  }
  return(vec)
}

bubble_sort2 = function(vec) {
  new_vec = swap_pass2(vec)
  if(identical(vec, new_vec)) new_vec 
  else bubble_sort2(new_vec)
}

No recursion

This was actually not intended a speed update, but R was complaining about too deep recursion. The same swapp_pass2 as before is used. 
bubble_sort3 = function(vec) {
  new_vec = swap_pass2(vec)
  while (!identical(vec, new_vec)) {
    vec <- new_vec
    new_vec = swap_pass2(vec)
  } 
  new_vec 
}

A different setup

The way I understood bubblesort is different from what is programmed so far. So far the sort is completed when no improvements are made, which is checked via a vector comparison. I always understood you don't check, the first bubble goes to the end, at which point the last element is the maximum. The second bubble then stops at end-1 etc. The final bubble is only the first two elements, after which the algorithm is finished. 
swap_pass4 = function(vec,iend) { 
  for(i in 1:iend) {
    if (vec[i] > vec[i+1] ) vec[c(i,i+1)] <- vec[c(i+1,i)]
  }
  return(vec)
}
bubble_sort4 = function(vec) {
  for (iend in (length(vec)-1):1) vec <- swap_pass4(vec,iend)
  vec
}

Tuning Paul's original algorithm

I was a bit disappointed with the improvements from using the true bubble sort. Apparently there is some gain by checking if the process is completed. Which is logical since the bubbles do some sorting of the intermediate elements. So, what about bubbles that go up and go down combined with a check on no improvements?
swap_pass3b = function(vec) { 
  for(i in length(vec):2) {
    if (vec[i] < vec[i-1] ) vec[c(i,i-1)] <- vec[c(i-1,i)]
  }
  return(vec)
}

bubble_sort3b = function(vec) {
  new_vec = swap_pass2(vec)
  while (!identical(vec, new_vec)) {
    new_vec = swap_pass2(vec)
    vec <- swap_pass3b(new_vec)
  } 
  vec 
}

Vectorizing the bubbles

When you look at the bubblesort without checking, it seems it is only assumed that the first step brings the highest element at the last position. This can be achieved by just pulling the highest element and placing it at the end, without the intermediate swaps.
bubble_sort5 = function(vec) {
  wm <- which.max(vec)
  vec <- c(vec[-wm],vec[wm])
  for (iend in ((length(vec)-1):2)) {
    wm <- which.max(vec[1:iend])
    vec <- c(vec[1:iend][-wm],vec[1:iend][wm],vec[(iend+1):length(vec)])
  }
  vec
}

Results

First a demonstration that they work;
test_vec = round(runif(10, 0, 100))
cbind(
    bubble_sort(test_vec),
    bubble_sort1(test_vec),
    bubble_sort2(test_vec),
    bubble_sort3(test_vec),
    bubble_sort4(test_vec),
    bubble_sort3b(test_vec),
    bubble_sort5(test_vec)
)
      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
 [1,]   28   28   28   28   28   28   28
 [2,]   41   41   41   41   41   41   41
 [3,]   42   42   42   42   42   42   42
 [4,]   46   46   46   46   46   46   46
 [5,]   52   52   52   52   52   52   52
 [6,]   68   68   68   68   68   68   68
 [7,]   89   89   89   89   89   89   89
 [8,]   93   93   93   93   93   93   93
 [9,]   98   98   98   98   98   98   98
[10,]  100  100  100  100  100  100  100

Speed

Now the timing. Cleaning the outer loops saves a third of the time. The bubble as I remembered it and the up-and-down bubble improve a bit. We are now at 13% of the original time. The final version, which uses the vectorization, blows them all out of the water. 
test_vec = runif(1000, 0, 100)
system.time(bubble_sort(test_vec))
#   user  system elapsed 
#  25.48    0.02   25.69 
system.time(bubble_sort1(test_vec))
#   user  system elapsed 
#  17.88    0.00   17.91 
system.time(bubble_sort2(test_vec))
#   user  system elapsed 
#   4.75    0.00    4.75 
system.time(bubble_sort3(test_vec))
#   user  system elapsed 
#   4.75    0.00    4.74 
system.time(bubble_sort4(test_vec))
#   user  system elapsed 
#   3.45    0.00    3.44 
system.time(bubble_sort3b(test_vec))
#   user  system elapsed 
#   3.57    0.00    3.59 
system.time(bubble_sort5(test_vec))
#   user  system elapsed 
#   0.06    0.00    0.06 

How about sort()?

Sort is much faster and scales much better, even more so after subtracting the time for the sample() call.. But I don't need to do the exercise to know that a not optimal algorithm in an interpreted language cannot compete with modern algorithms in a compiled language.
library(microbenchmark)
vector_size = 100                                                                           
print(microbenchmark(bubble_sort5(sample(vector_size)),                                                                          
        sort(sample(vector_size)),
        sample(vector_size))  )
Unit: microseconds
                              expr      min       lq    median       uq
 bubble_sort5(sample(vector_size)) 1730.041 1776.957 1806.2795 1866.758
         sort(sample(vector_size))  104.096  114.359  126.4540  180.334
               sample(vector_size)    5.865    8.797    9.8965   11.729
      max neval
 9978.521   100
  233.849   100
   21.992   100

Saturday, May 11, 2013

Reshaping data

Preparing and reshaping data is the ever continuing task of a data analyst. Luckily we have many tools for it. The default tool in R would be reshape(), although this is so user friendly that a reshape package has been added too. I try to use reshape() (the function) because I feel it is a good tool, though with a somewhat cryptical manual. The latter may be because it is written in terms of longitudal data, whereas my experience is  conversion of data from easy to enter in Excel to suitable for analysis in R.
To exercise myself a bit more I have taken all examples from the SAS transpose procedure and implemented them in R.

Examples 1 to 3

These examples are so simple, the best tool is the t() function.

score <- read.table(textConnection('
Student StudentID  Section  Test1 Test2 Final
Capalleti 0545 1  94 91 87
Dubose    1252 2  51 65 91
Engles    1167 1  95 97 97
Grant     1230 2  63 75 80
Krupski   2527 2  80 76 71
Lundsford 4860 1  92 40 86
McBane    0674 1  75 78 72'),header=TRUE,
  colClasses=rep(c('character','numeric'),each=3))
# id, only numerical data
# this case - t(score[,-1:-3])
# general
t(score[,sapply(score,class)=='numeric'])

      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
Test1   94   51   95   63   80   92   75
Test2   91   65   97   75   76   40   78
Final   87   91   97   80   71   86   72

#example 2: studentid
score2 <- score
rownames(score2) <- paste('sn',score2$StudentID,sep='')
t(score2[,-1:-3])

      sn0545 sn1252 sn1167 sn1230 sn2527 sn4860 sn0674
Test1     94     51     95     63     80     92     75
Test2     91     65     97     75     76     40     78
Final     87     91     97     80     71     86     72
#example 3 student names

rownames(score2) <- score2$Student
t(score2[,-1:-3])

      Capalleti Dubose Engles Grant Krupski Lundsford McBane
Test1        94     51     95    63      80        92     75
Test2        91     65     97    75      76        40     78
Final        87     91     97    80      71        86     72

Example 4 by groups

The first 'real' example. Some columns are used to transpose on. In addition the data is ragged, not everywhere there is data. Two unexpected problems appeared here. Location contains a space, which I solved by separating this part and adding it later. Date is in a 7 digit format, but not my locale and a non default layout. Since I need to sort in order to get the data ordered like SAS does, I converted it to a proper date variable.
The transpose itself is not so difficult. I chose to extract the variables to transpose via grep(), so I could reuse this part in the times variable. To make tracking between call and result easy, I used lower and upper case in 'length'. Length is transposed, so weight is dropped from the data. 
# example 4 by groups

fishdata1 <- readLines(textConnection(
'Cole Pond   2JUN95 31 .25 32 .3  32 .25 33 .3
Cole Pond   3JUL95 33 .32 34 .41 37 .48 32 .28
Cole Pond   4AUG95 29 .23 30 .25 34 .47 32 .3
Eagle Lake  2JUN95 32 .35 32 .25 33 .30
Eagle Lake  3JUL95 30 .20 36 .45
Eagle Lake  4AUG95 33 .30 33 .28 34 .42'))
fishdata <- read.table(text=c('Date Length1 Weight1 Length2 Weight2 Length3 Weight3 Length4 Weight4',
           substring(fishdata1,11)),
       fill=TRUE,
       header=TRUE)
fishdata$Location <- gsub(' $','',substring(fishdata1,1,10))
Sys.setlocale(category = "LC_TIME", locale = "C")
fishdata$Date <- as.Date(fishdata$Date,'%d%b%y')
lengthfishdata <- reshape(fishdata,
    varying=list(grep('Length',names(fishdata),value=TRUE)),
    direction='long',
    idvar=c('Date','Location'),
    drop=grep('Weight',names(fishdata),value=TRUE),
    v.names='LENGTH',
    timevar='NAME',
    times=tolower(grep('Length',names(fishdata),value=TRUE)),
  )
rownames(lengthfishdata)  <- 1:nrow(lengthfishdata)
lengthfishdata[order(lengthfishdata$Location,
        lengthfishdata$Date,
        lengthfishdata$NAME),]

         Date   Location    NAME LENGTH
1  1995-06-02  Cole Pond length1     31
7  1995-06-02  Cole Pond length2     32
13 1995-06-02  Cole Pond length3     32
19 1995-06-02  Cole Pond length4     33
2  1995-07-03  Cole Pond length1     33
8  1995-07-03  Cole Pond length2     34
14 1995-07-03  Cole Pond length3     37
20 1995-07-03  Cole Pond length4     32
3  1995-08-04  Cole Pond length1     29
9  1995-08-04  Cole Pond length2     30
15 1995-08-04  Cole Pond length3     34
21 1995-08-04  Cole Pond length4     32
4  1995-06-02 Eagle Lake length1     32
10 1995-06-02 Eagle Lake length2     32
16 1995-06-02 Eagle Lake length3     33
22 1995-06-02 Eagle Lake length4     NA
5  1995-07-03 Eagle Lake length1     30
11 1995-07-03 Eagle Lake length2     36
17 1995-07-03 Eagle Lake length3     NA
23 1995-07-03 Eagle Lake length4     NA
6  1995-08-04 Eagle Lake length1     33
12 1995-08-04 Eagle Lake length2     33
18 1995-08-04 Eagle Lake length3     34
24 1995-08-04 Eagle Lake length4     NA
In practice my call would be different, I would keep both height and weight, and stick the number in a more logical variable than NAME. I think SAS can only achieve this by two transpose calls and a subsequent merge, though I may be mistaken.
reshape(fishdata,
    varying=list(grep('Length',names(fishdata),value=TRUE),
        grep('Weight',names(fishdata),value=TRUE)),
    direction='long',
    idvar=c('Date','Location'),
    v.names=c('Length','Weight'),
    timevar='Fish number',
    new.row.names=1:24
)

         Date   Location Fish number Length Weight
1  1995-06-02  Cole Pond           1     31   0.25
2  1995-07-03  Cole Pond           1     33   0.32
3  1995-08-04  Cole Pond           1     29   0.23
4  1995-06-02 Eagle Lake           1     32   0.35
5  1995-07-03 Eagle Lake           1     30   0.20
6  1995-08-04 Eagle Lake           1     33   0.30
7  1995-06-02  Cole Pond           2     32   0.30
8  1995-07-03  Cole Pond           2     34   0.41
9  1995-08-04  Cole Pond           2     30   0.25
10 1995-06-02 Eagle Lake           2     32   0.25
11 1995-07-03 Eagle Lake           2     36   0.45
12 1995-08-04 Eagle Lake           2     33   0.28
13 1995-06-02  Cole Pond           3     32   0.25
14 1995-07-03  Cole Pond           3     37   0.48
15 1995-08-04  Cole Pond           3     34   0.47
16 1995-06-02 Eagle Lake           3     33   0.30
17 1995-07-03 Eagle Lake           3     NA     NA
18 1995-08-04 Eagle Lake           3     34   0.42
19 1995-06-02  Cole Pond           4     33   0.30
20 1995-07-03  Cole Pond           4     32   0.28
21 1995-08-04  Cole Pond           4     32   0.30
22 1995-06-02 Eagle Lake           4     NA     NA
23 1995-07-03 Eagle Lake           4     NA     NA
24 1995-08-04 Eagle Lake           4     NA     NA

Example 5

Example 5 is named: Naming Transposed Variables When the ID Variable Has Duplicate Values. I am not sure what the naming part is, it seems that only the closing call price is needed, which boils down to taking only the last observation in a category. The data here comes in a fixed format array, I have used (first time) read.fwf() and some calls to strip the spaces from the factors.
To get the transpose I borrowed the idea of a last function from SAS, which is basically a function which indicates that the current record in a variable is different from the next record. Which is very important in SAS because it is fundamentally row (record) organized, whereas R is column (variable) organized. Proper R would just processing dependent on Time='closing' but what is used is functionally closer to the SAS call.
stocks <- read.fwf(textConnection(
'Horizon Kites jun11 opening 29
Horizon Kites jun11 noon    27
Horizon Kites jun11 closing 27
Horizon Kites jun12 opening 27
Horizon Kites jun12 noon    28
Horizon Kites jun12 closing 30
SkyHi Kites   jun11 opening 43
SkyHi Kites   jun11 noon    43
SkyHi Kites   jun11 closing 44
SkyHi Kites   jun12 opening 44
SkyHi Kites   jun12 noon    45
SkyHi Kites   jun12 closing 45'
),col.names=c('Company','Date','Time','Price'),
widths=c(14,5,8,3))
levels(stocks$Company) <- gsub('(^ +)|( +$)','',levels(stocks$Company))
levels(stocks$Time) <- gsub('(^ +)|( +$)','',levels(stocks$Time))
# only last observation
islast <- function(x)  c(x[-length(x)]!=x[-1],TRUE)
reshape(stocks[islast(stocks$Date),],direction='wide',
    timevar='Date',idvar=c('Company'),drop='Time',times=Time)

        Company Price.jun11 Price.jun12
3 Horizon Kites          27          30
9   SkyHi Kites          44          45

Example 6 

Transposing for statistical analysis. We have 5 subjects, who did 3 programs and 7 strength assessments. Why subject is not in the original data baffles me. In SAS this is not run through proc transpose, but rather a data step. But that doesn't stop me from using reshape(). The subject variable is added. In the next step SAS rebuilds to get one file with both formats, seems silly in R context, I am just transforming back, now using subject. 

weights <- read.table(textConnection(
'Program s1 s2 s3 s4 s5 s6 s7
CONT  85 85 86 85 87 86 87
CONT  80 79 79 78 78 79 78
CONT  78 77 77 77 76 76 77
CONT  84 84 85 84 83 84 85
CONT  80 81 80 80 79 79 80
RI    79 79 79 80 80 78 80
RI    83 83 85 85 86 87 87
RI    81 83 82 82 83 83 82
RI    81 81 81 82 82 83 81
RI    80 81 82 82 82 84 86
WI    84 85 84 83 83 83 84
WI    74 75 75 76 75 76 76
WI    83 84 82 81 83 83 82
WI    86 87 87 87 87 87 86
WI    82 83 84 85 84 85 86
'),header=TRUE)
weights1 <- reshape(weights,direction='long',timevar='time',idvar='row',
    varying=list(paste('s',1:7,sep='')),v.names='Strength')
weights1$subject <- 1+ ((weights1$row-1) %% 5)
weights1 <- weights1[order(weights1$Program,weights1$subject,weights1$time),]
(Weights1 <- weights1[,c(1,5,2,3)])[1:15,]

    Program subject time Strength
1.1    CONT       1    1       85
1.2    CONT       1    2       85
1.3    CONT       1    3       86
1.4    CONT       1    4       85
1.5    CONT       1    5       87
1.6    CONT       1    6       86
1.7    CONT       1    7       87
2.1    CONT       2    1       80
2.2    CONT       2    2       79
2.3    CONT       2    3       79
2.4    CONT       2    4       78
2.5    CONT       2    5       78
2.6    CONT       2    6       79
2.7    CONT       2    7       78
3.1    CONT       3    1       78
reshape(Weights1,direction='wide',timevar='time',
    idvar=c('Program','subject'))

     Program subject Strength.1 Strength.2 Strength.3 Strength.4 Strength.5
1.1     CONT       1         85         85         86         85         87
2.1     CONT       2         80         79         79         78         78
3.1     CONT       3         78         77         77         77         76
4.1     CONT       4         84         84         85         84         83
5.1     CONT       5         80         81         80         80         79
6.1       RI       1         79         79         79         80         80
7.1       RI       2         83         83         85         85         86
8.1       RI       3         81         83         82         82         83
9.1       RI       4         81         81         81         82         82
10.1      RI       5         80         81         82         82         82
11.1      WI       1         84         85         84         83         83
12.1      WI       2         74         75         75         76         75
13.1      WI       3         83         84         82         81         83
14.1      WI       4         86         87         87         87         87
15.1      WI       5         82         83         84         85         84
     Strength.6 Strength.7
1.1          86         87
2.1          79         78
3.1          76         77
4.1          84         85
5.1          79         80
6.1          78         80
7.1          87         87
8.1          83         82
9.1          83         81
10.1         84         86
11.1         83         84
12.1         76         76
13.1         83         82
14.1         87         86
15.1         85         86


Sunday, May 5, 2013

Simulation shows gain of clmm over ANOVA is small

After last post's setting up for a simulation, it is now time to look how the models compare. To my disappointment with my simple simulations of assessors behavior the gain is minimal. Unfortunately, the simulation took much more time than I expected, so I will not expand it.

Background

I have been looking at the ordered logistic model in a number of postings. The reason is that in general I find people use ANOVA for analyzing data on a nine point scale, whereas you would think an ordered logistic model works better. Two posts (first and second) showed the current methods in R, and and JAGS are easy to use and with some tweaking provide suitable output to present. Subsequently I made the simulated data generator. Now it is time to make the final comparison.  

Simulations

The core of the simulator is explained elsewhere so I won't explain here again. I did however notice a small error, so the corrected code is given here. Some new parts are added, wrappers around the data generator and the analysis. And to my big disappointment I could not even build that as desired. The call anova(Res.clmm2,Res.clmm) with subsequent extraction has been replaced by the ugly pchisq(2*(Res.clmm$logLik-Res.clmm2$logLik),2,lower.tail=FALSE ). 2 represents the degrees of freedom for products in my simulations. Somehow that call to ANOVA did not run within a function, after trying too many variations I choose the short cut.

library(ordinal) 
library(ggplot2)

num2scale <- function(x,scale) {
  cut(x,c(-Inf,scale,Inf),1:(length(scale)+1))
}
pop.limits2ind.limits <- function(scale,sd=.5) {
  newscale <- scale+rnorm(length(scale),sd=sd)
  newscale[order(newscale)]
}

obs.score <- function(obs,pop.score,pop.limits,
    sensitivity.sd=.1, precision.sd=1,
    additive.sd=2,center.scale=5,
    labels=LETTERS[1:length(pop.score)]) {
  # individual sensitivity (multiplicative)
  obs.errorfreeintensity <- center.scale + 
      (pop.score-center.scale)*rlnorm(1,sd=sensitivity.sd)
  #individual (additive) 
  obs.errorfreeintensity <- obs.errorfreeintensity +
      rnorm(1,mean=0,sd=additive.sd)
  # individual observation error 
  obs.intensity <- obs.errorfreeintensity+
      rnorm(length(pop.score),sd=precision.sd)
  
  # individual cut offs between categories  
  obs.limits <- pop.limits2ind.limits(pop.limits)
  obs.score <- num2scale(obs.intensity,obs.limits)
  data.frame(obs=obs,
      score = obs.score,
      product=labels
  )
}

panel.score <- function(n=100,pop.score,pop.limits,
    sensitivity.sd=.1,precision.sd=1,
    additive.sd=2,center.scale=5,
    labels=LETTERS[1:length(pop.score)]) {
  la <- lapply(1:n,function(x) {
        obs.score(x,pop.score,pop.limits,sensitivity.sd=sensitivity.sd,
            precision.sd=precision.sd,additive.sd=additive.sd,labels=labels)
      })
  dc <- do.call(rbind,la)
  dc$obs <- factor(dc$obs)
  dc
}

overallP <- function(scores) {
  Res.aov <- aov( numresponse ~ obs +  product , data=scores)
  paov <- anova(Res.aov)['product','Pr(>F)']
  Res.clmm <- clmm(score  ~ product + (1|obs),data=scores)
  Res.clmm2 <- clmm(score ~ 1 + (1|obs),data=scores)
  c(ANOVA=paov,
    clmm = pchisq(2*(Res.clmm$logLik-Res.clmm2$logLik),2,lower.tail=FALSE ) #.1687
   )
}

onesim <- function(prodmeans,pop.limits,center.scale,additive.sd=2) {
  scores <- panel.score(40,prodmeans,pop.limits,
      center.scale=center.scale,additive.sd=additive.sd)
  scores$numresponse <- as.numeric(levels(scores$score))[scores$score]
  overallP(scores)
}

Simulation I, 5 categories

The first simulation is with 5 categories. This represents the Just About Right (JAR) scale. The final plot shows the difference between ANOVA and clmm is minimal.

pop.limits <- c(1,2.5,4.5,6)

nsim <- 250
sim5cat1 <- lapply(seq(0,.6,.05),function(dif) {
    prodmeans=rep(3.5,3)+c(-dif,0,dif)
    sa <- sapply(1:nsim,function(x) onesim(prodmeans,pop.limits))
    data.frame(dif=dif,nreject=as.numeric(rowSums(sa<0.05)),
        method=rownames(sa))
  }
)

sim5cat1tb <- do.call(rbind,sim5cat1)
ggplot(sim5cat1tb, aes(dif,nreject/nsim ,colour=method)) +
     geom_line() + xlab('Difference between products') + 
     ylab('Proportion significant (at 5% Test)') +
     theme(legend.position = "bottom") + ylim(c(0,1)) +
     guides(colour = guide_legend('Analysis Method'))

Simulation 2, 9 categories

This simulation represents the intensity and liking scales. Again the difference between ANOVA and clmm are minimal.

pop.limits <- c(1,3:8,10)
prodmeans <- c(7,7,7)
scores <- panel.score(40,prodmeans,pop.limits,additive.sd=1)
scores$numresponse <- as.numeric(levels(scores$score))[scores$score]
nsim <- 250
sim9cat <- lapply(seq(0.,.6,.05),function(dif) {
      prodmeans=c(7,7,7)+c(-dif,0,dif)
      sa <- sapply(1:nsim,function(x) onesim(prodmeans,pop.limits,
                additive.sd=1))
      data.frame(dif=dif,nsign=as.numeric(rowSums(sa>0.05)),
          method=rownames(sa))
    }
)
sim9cattb <- do.call(rbind,sim9cat)
ggplot(sim9cattb, aes(dif,(nsim-nsign)/nsim ,colour=method)) +
    geom_line() + xlab('Difference between products') + 
    ylab('Proportion significant (at 5% Test)') +
    theme(legend.position = "bottom") + ylim(c(0,1)) +
    guides(colour = guide_legend('Analysis Method'))

Conclusion

The differences between clmm and ANOVA seem to be minimal. I had not expected large differences, but especially at 5 categories my expectation were to find real differences as the continuous data is more violated. Obviously, a load more simulations would be needed to draw final conclusions. This is beyond the scope of my blog. 
To conclude, in theory clmm is much more suitable than ANOVA for ordinal data. There are no reasons in terms of presentation to prefer ANOVA over clmm. But in practice the difference may be minimal.

Sunday, April 21, 2013

Ordinal data, models with observers

I recently made three posts regarding analysis of ordinal data. A post looking at all methods I could find in R, a post with an additional method and a post using JAGS. Common in all three was using the cheese data, a data set where only product data was available, observers were not there. The real world is different, there are observers and they are quite different. So, in this post I have added observers and look at two models again; Simple ANOVA and, complex, an ordinal model with random observers.

Data

To spare myself looking for data, and keeping in mind I may want to run some simulations, I have created a data generator. In this generator I entered some of the things I could imagine regarding the observers. First of all they all have different sensitivities. In other words, what is very salt for some, is not at all salt for others. This is expressed in an additive and a multiplicative effect. Second, they have observation error. The same product does not always give the same response. Third is the categories, observers use them slightly different. I am absolutely aware these are much more properties than I might estimate. After all, a typical consumer takes two or three products in a test, while I have many more parameters. Finally, the outer limits (categories 1 and 9) are placed a bit outward. This is because there is an aversion to using these categories.
num2scale <- function(x,scale) {
  cut(x,c(-Inf,scale,Inf),1:(length(scale)+1))
}
pop.limits2ind.limits <- function(scale,sd=.5) {
  newscale <- scale+rnorm(length(scale),sd=sd)
  newscale[order(newscale)]
}

obs.score <- function(obs,pop.score,pop.limits,scale.sd=0.5,
    sensitivity.sd=.1, precision.sd=1,
    additive.sd=2,center.scale=5,
    labels=LETTERS[1:length(pop.score)]) {
  # individual sensitivity (multiplicative)
  obs.errorfreeintensity <- center.scale + 
      (pop.score-center.scale)*rlnorm(1,sd=sensitivity.sd)
  #individual (additive) 
  obs.errorfreeintensity <- obs.errorfreeintensity +
      rnorm(1,mean=0,sd=additive.sd)
  # individual observation error 
  obs.intensity <- obs.errorfreeintensity+
      rnorm(length(pop.score),sd=precision.sd)
  # individual cut offs between categories  
  obs.limits <- pop.limits2ind.limits(pop.limits)
  obs.score <- num2scale(obs.intensity,obs.limits)
  data.frame(obs=obs,
      score = obs.score,
      product=labels
  )
}

panel.score <- function(n=100,pop.score,pop.limits,scale.sd=0.5,
    sensitivity.sd=.1,precision.sd=1,
    additive.sd=2,center.scale=5,
    labels=LETTERS[1:length(pop.score)]) {
  la <- lapply(1:n,function(x) {
        obs.score(x,pop.score,pop.limits,scale.sd,sensitivity.sd,
            precision.sd,labels=labels)
      })
  dc <- do.call(rbind,la)
  dc$obs <- factor(dc$obs)
  dc
}

pop.limits <- c(1,3:8,10)
prodmeans <- c(7,7,8)
scores <- panel.score(40,prodmeans,pop.limits)
scores$numresponse <- as.numeric(levels(scores$score))[scores$score]

Analysis - ANOVA

The first analysis method is ANOVA. Plain ANOVA. Not because I dislike variance components or such, but because this is what is done most commonly. On top of that, I cannot imagine somebody taking these data, pulling it through a mixed model, while forgetting it is not from a continuous scale.
library(multcomp)
Res.aov <- aov( numresponse ~ obs + product , data=scores)
anova(Res.aov)

Analysis of Variance Table

Response: numresponse
          Df  Sum Sq Mean Sq F value    Pr(>F)    
obs       39 239.300  6.1359  6.3686 2.321e-12 ***
product    2   9.517  4.7583  4.9388   0.00956 ** 
Residuals 78  75.150  0.9635                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
mt <- model.tables(Res.aov,type='means',cterms='product') 

data.frame(dimnames(mt$tables$product)[1],
    Response=as.numeric(mt$tables$product),
    LSD=cld(glht(Res.aov,linfct=mcp(product='Tukey'))
    )$mcletters$monospacedLetters
)    
  product Response LSD
A       A    6.725  a 
B       B    6.850  a 
C       C    7.375   b

Analysis - cumulative link mixed model (clmm)

The alternative, is the other extreme. Do it as correct as available, using both a cumulative link and making observers random. This model is standard available in the ordinal package. This means two intermediate models are not shown. Both of these models lack the simplicity of ANOVA while being less correct than clmm, hence inferior to both clmm and ANOVA.
library(ordinal) 
Res.clmm <- clmm(score  ~ product + (1|obs),data=scores)
summary(Res.clmm)

Cumulative Link Mixed Model fitted with the Laplace approximation

formula: score ~ product + (1 | obs)
data:    scores

 link  threshold nobs logLik  AIC    niter    max.grad cond.H
 logit flexible  120  -179.57 379.14 37(5046) 8.77e-06 Inf   

Random effects:
      Var Std.Dev
obs 8.425   2.903
Number of groups:  obs 40 

Coefficients:
         Estimate Std. Error z value Pr(>|z|)   
productB   0.1309     0.4225   0.310  0.75680   
productC   1.4640     0.4718   3.103  0.00192 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Threshold coefficients:
    Estimate Std. Error z value
2|3  -6.0078         NA      NA
3|4  -5.6340         NA      NA
4|5  -4.1059     0.4528  -9.069
5|6  -3.0521     0.4769  -6.400
6|7  -1.0248     0.4970  -2.062
7|8   0.5499     0.5303   1.037
8|9   4.0583     0.7429   5.463
Res.clmm2 <- clmm(score ~ 1 + (1|obs),data=scores)

anova(Res.clmm2,Res.clmm)

Likelihood ratio tests of cumulative link models:

          formula:                    link: threshold:
Res.clmm2 score ~ 1 + (1 | obs)       logit flexible  
Res.clmm  score ~ product + (1 | obs) logit flexible  

          no.par    AIC  logLik LR.stat df Pr(>Chisq)   
Res.clmm2      8 387.41 -185.70                         
Res.clmm      10 379.14 -179.57  12.266  2    0.00217 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
co <- coef(Res.clmm)[c('7|8','productB','productC')]

vc <- vcov(Res.clmm)[c('7|8','productB','productC'),
    c('7|8','productB','productC')]
names(co) <- levels(scores$product)
sd <- sqrt(c(vc[1,1],diag(vc)[-1]+vc[1,1]-2*vc[1,-1]))
data.frame(
    `top 2 box`=arm::invlogit(c(-co[1],-co[1]+co[-1])),
    `2.5% limit`=arm::invlogit(c(-co[1],-co[1]+co[-1])+qnorm(.025)*sd),
    `97.5% limit`=arm::invlogit(c(-co[1],-co[1]+co[-1])+qnorm(.975)*sd),
    check.names=FALSE
)

  top 2 box 2.5% limit 97.5% limit
A 0.3658831  0.1694873   0.6199713
B 0.3967401  0.1753347   0.6704337
C 0.7138311  0.4498042   0.8838691