Data
Data are responses to 4 cheeses on 9 response categories. 1=strong dislike, 9=excellent taste. To be honest, I had forgotten about those data, and ran into them https://onlinecourses.science.psu.edu/stat504/node/177. For compactness (of R script) the data is imported a bit different. I also reformatted to one observation per row. Then a plot so we can see what the data looks like.cheese <- data.frame(Cheese=rep(LETTERS[1:4],each=9),
Response=rep(1:9,4),N=as.integer(
c(0,0,1,7,8,8,19,8,1,6,9,12,
11,7,6,1,0,0,1,1,6,8,23,7,5
,1,0,0,0,0,1,3,7,14,16,11)))
cheese$FResponse <- ordered(cheese$Response)
#one record(row) for each observation
cheese2 <- cheese[cheese$N>0,]
cheese2 <- lapply(1:nrow(cheese2),function(x)
do.call(rbind,
replicate(cheese2$N[x],cheese2[x,c(1:2,4)],
simplify=FALSE)
)
)
cheese2 <- do.call(rbind,cheese2)
mosaicplot(
xtabs( ~Cheese + factor(Response,levels=9:1) ,data=cheese2),
color=colorRampPalette(c('green','red'))(9),
main='',
ylab='Response')
ANOVA
As explained, first approach is ANOVA. From statistical point of view not ideal, ignoring the fact that these are nine categories. However, when explaining what you did, it saves a bit. I would not trust a sales person to understand ANOVA, but then the other side of the table is used to it, so they won't ask. Anyway, below a number of variations on the result.library(multcomp)
Res.aov <- aov( Response ~ Cheese, data=cheese2)
anova(Res.aov)
Analysis of Variance Table
Response: Response
Df Sum Sq Mean Sq F value Pr(>F)
Cheese 3 450.48 150.160 76.296 < 2.2e-16 ***
Residuals 204 401.50 1.968
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(Res.aov)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Response ~ Cheese, data = cheese2)
$Cheese
diff lwr upr p adj
B-A -2.750000 -3.4626832 -2.0373168 0.0000000
C-A -1.384615 -2.0972986 -0.6719321 0.0000063
D-A 1.173077 0.4603937 1.8857602 0.0001793
C-B 1.365385 0.6527014 2.0780679 0.0000087
D-B 3.923077 3.2103937 4.6357602 0.0000000
D-C 2.557692 1.8450091 3.2703755 0.0000000
par(mar=c(5,4,6,2)+.1)
plot(cld(glht(Res.aov,linfct=mcp(Cheese='Tukey'))))par(mar=c(5,4,4,2)+.1)
mt <- model.tables(Res.aov,type='means',cterms='Cheese')
data.frame(dimnames(mt$tables$Cheese)[1],
Response=as.numeric(mt$tables$Cheese),
LSD=cld(glht(Res.aov,linfct=mcp(Cheese='Tukey'))
)$mcletters$monospacedLetters
)
Cheese Response LSD
A A 6.250000 c
B B 3.500000 a
C C 4.865385 b
D D 7.423077 d
polr
The first statistically correct way to look at the data is polr from the MASS package. Unfortunately I dislike the output a bit. For instance, base level (cheese A) is ignored in some output. I am not so sure how to interpret the difference between cheese A and cheese B as -3 except for the observation that it is significant and cheese A is better. A thing not obvious documented is the possibility to get proportions for the categories as predictions. So, I can learn that the model has for cheese A 17% of the people in the top 2 box. I am sure top 2 box is a parameter I won't have to explain to my product manager. No confidence interval for it though.library(MASS)
Res.polr <- polr( FResponse ~ Cheese, data=cheese2 )
summary(Res.polr)
Call:
polr(formula = FResponse ~ Cheese, data = cheese2)
Coefficients:
Value Std. Error t value
CheeseB -3.352 0.4287 -7.819
CheeseC -1.710 0.3715 -4.603
CheeseD 1.613 0.3805 4.238
Intercepts:
Value Std. Error t value
1|2 -5.4674 0.5236 -10.4413
2|3 -4.4122 0.4278 -10.3148
3|4 -3.3126 0.3700 -8.9522
4|5 -2.2440 0.3267 -6.8680
5|6 -0.9078 0.2833 -3.2037
6|7 0.0443 0.2646 0.1673
7|8 1.5459 0.3017 5.1244
8|9 3.1058 0.4057 7.6547
Residual Deviance: 711.3479
AIC: 733.3479
confint(Res.polr)
2.5 % 97.5 %
CheeseB -4.2134700 -2.5304266
CheeseC -2.4508895 -0.9919797
CheeseD 0.8794083 2.3746894
confint(glht(Res.polr,linfct=mcp(Cheese='Tukey')))
Multiple Comparisons of Means: Tukey Contrasts
Fit: polr(formula = FResponse ~ Cheese, data = cheese2)
Quantile = 2.5617
95% family-wise confidence level
Linear Hypotheses:
Estimate lwr upr
B - A == 0 -3.3518 -4.4500 -2.2537
C - A == 0 -1.7099 -2.6615 -0.7583
D - A == 0 1.6128 0.6379 2.5876
C - B == 0 1.6419 0.6806 2.6033
D - B == 0 4.9646 3.7434 6.1858
D - C == 0 3.3227 2.2421 4.4033
topred <- data.frame(Cheese=levels(cheese2$Cheese))
predict(Res.polr,topred,type='prob')
1 2 3 4 5 6
1 0.0042045373 0.007778488 0.02315719 0.06072687 0.1915897 0.22360710
2 0.1075966845 0.149644074 0.25256118 0.24192345 0.1684022 0.04745525
3 0.0228101657 0.040027267 0.10476248 0.20195874 0.3208724 0.16204645
4 0.0008409326 0.001570815 0.00479563 0.01349081 0.0537320 0.09799769
7 8 9
1 0.31326180 0.132805278 0.042868991
2 0.02500937 0.005841786 0.001566035
3 0.11040482 0.029081216 0.008036485
4 0.31086686 0.333235095 0.183470165
topred$Top2Box.polr <- rowSums(predict(Res.polr,topred,type='prob')[,8:9])
topredCheese Top2Box.polr
1 A 0.175674269
2 B 0.007407821
3 C 0.037117701
4 D 0.516705260
clm
Clm is from the ordinal package. As this package is dedicated to ordinal data it is clearly a bit more advanced than polr. The package has the possibility to use mixed models and multiplicative scale effects. These are things I won't use now, but would like to use or look at once I have panelist data. For now clm function is enough. I am now able to make an overall statement about product differences (polr did not calculate the equivalent of model Res.clm0), although I might use 3 degrees of freedom rather than 1. There is also a confidence interval for the proportion subjects selecting a category.library(ordinal)
summary(Res.clm)
formula: FResponse ~ Cheese
data: cheese2
link threshold nobs logLik AIC niter max.grad cond.H
logit flexible 208 -355.67 733.35 6(0) 3.14e-11 8.8e+01
Coefficients:
Estimate Std. Error z value Pr(>|z|)
CheeseB -3.3518 0.4287 -7.819 5.34e-15 ***
CheeseC -1.7099 0.3715 -4.603 4.16e-06 ***
CheeseD 1.6128 0.3805 4.238 2.25e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Threshold coefficients:
Estimate Std. Error z value
1|2 -5.46738 0.52363 -10.441
2|3 -4.41219 0.42776 -10.315
3|4 -3.31262 0.37004 -8.952
4|5 -2.24401 0.32674 -6.868
5|6 -0.90776 0.28335 -3.204
6|7 0.04425 0.26457 0.167
7|8 1.54592 0.30168 5.124
8|9 3.10577 0.40573 7.655
confint(Res.clm)
2.5 % 97.5 %
CheeseB -4.2134421 -2.5304102
CheeseC -2.4508720 -0.9919723
CheeseD 0.8793992 2.3746663
Res.clm0 <- clm(FResponse ~ .,data=cheese2)
anova(Res.clm0,Res.clm)Likelihood ratio tests of cumulative link models:
formula: link: threshold:
Res.clm FResponse ~ Cheese logit flexible
Res.clm0 FResponse ~ . logit flexible
no.par AIC logLik LR.stat df Pr(>Chisq)
Res.clm 11 733.35 -3.5567e+02
Res.clm0 12 24.00 -7.8504e-06 711.35 1 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
predict(Res.clm,topred)
$fit
1 2 3 4 5 6
1 0.0042045475 0.007778711 0.023157362 0.06072674 0.1915909 0.22360309
2 0.1075968958 0.149647575 0.252560377 0.24192109 0.1684021 0.04745442
3 0.0228099720 0.040027964 0.104762084 0.20195678 0.3208734 0.16204462
4 0.0008409256 0.001570844 0.004795616 0.01349065 0.0537319 0.09799495
7 8 9
1 0.31326168 0.132806881 0.042870069
2 0.02500956 0.005841882 0.001566077
3 0.11040645 0.029081977 0.008036783
4 0.31086254 0.333236858 0.183475714
rowSums(predict(Res.clm,topred)$fit[,8:9])
1 2 3 4
0.175676950 0.007407959 0.037118760 0.516712572
1-predict(Res.clm,topred,type='cum.prob')$cprob2[,8]
0.175676950 0.007407959 0.037118760 0.516712572
topred2 <- data.frame(Cheese=levels(cheese2$Cheese),
FResponse=ordered(8,levels=1:9))predict(Res.clm,topred2,type='prob',interval=TRUE)
$fit
[1] 0.132806881 0.005841882 0.029081977 0.333236858
$lwr
[1] 0.078727094 0.002502026 0.014310133 0.230268677
$upr
[1] 0.21535183 0.01357929 0.05820190 0.45502986
VGAM
I like the VGAM package. It shows great promise for many situations where a vector of observations is measured. This model is but a simple thing for it. On the face of it however, ordinal seems to have extensions which are more appropriate for my current problem. Also, in the help it says 'This package is undergoing continual development and improvement. Until my monograph comes out and this package is released as version 1.0-0 the user should treat everything subject to change'. The data needs to be entered a bit differently and the script is commented, in case anybody wants to know how it might work.
#library(VGAM)
#nobs <- nrow(cheese2)/4
#resD <- resC <- resB <- resA <- matrix(0,9,nobs)
#resA[cheese2$Response[cheese2$Cheese=='A']+(0:(nobs-1))*9] <- 1
#resB[cheese2$Response[cheese2$Cheese=='B']+(0:(nobs-1))*9] <- 1
#resC[cheese2$Response[cheese2$Cheese=='C']+(0:(nobs-1))*9] <- 1
#resD[cheese2$Response[cheese2$Cheese=='D']+(0:(nobs-1))*9] <- 1
#cheesevglm <- as.data.frame(t(cbind(resA,resB,resC,resD)))
#cheesevglm$Cheese <- factor(rep(levels(cheese2$Cheese),each=nobs))
#
#Res.vglm <- vglm(cbind(V1,V2,V3,V4,V5,V6,V7,V8,V9) ~ Cheese,cumulative,data=cheesevglm)
#summary(Res.vglm)
#predict(Res.vglm,topred,type='response')
res.MCMCpack <- MCMCoprobit(Response ~ Cheese,data=cheese2)
summary(res.MCMCpack)
Iterations = 1001:11000
Thinning interval = 1
Number of chains = 1
Sample size per chain = 10000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
(Intercept) 3.2423 0.16115 0.0016115 0.016761
CheeseB -1.9334 0.22376 0.0022376 0.019444
CheeseC -0.9782 0.20786 0.0020786 0.008635
CheeseD 0.9382 0.20754 0.0020754 0.005334
gamma2 0.6401 0.13234 0.0013234 0.064949
gamma3 1.3265 0.14265 0.0014265 0.072361
gamma4 1.9697 0.10030 0.0010030 0.036723
gamma5 2.7635 0.09922 0.0009922 0.035434
gamma6 3.3180 0.11913 0.0011913 0.051066
gamma7 4.1606 0.12626 0.0012626 0.057679
gamma8 4.9981 0.11750 0.0011750 0.049354
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
(Intercept) 2.9288 3.1358 3.2430 3.3497 3.5576
CheeseB -2.3606 -2.0869 -1.9372 -1.7850 -1.4926
CheeseC -1.3895 -1.1200 -0.9763 -0.8394 -0.5730
CheeseD 0.5378 0.7979 0.9380 1.0760 1.3434
gamma2 0.4428 0.5227 0.6221 0.7498 0.8878
gamma3 1.1034 1.2156 1.3013 1.4400 1.6292
gamma4 1.7679 1.9017 1.9779 2.0465 2.1424
gamma5 2.5821 2.6867 2.7606 2.8439 2.9328
gamma6 3.1207 3.2393 3.2912 3.3880 3.5808
gamma7 3.8694 4.0840 4.1776 4.2596 4.3487
gamma8 4.8306 4.9177 4.9788 5.0353 5.2884
top2box <- data.frame(probability = c(
pnorm(res.MCMCpack[,'gamma7']-(res.MCMCpack[,1]),lower.tail=FALSE),
pnorm(res.MCMCpack[,'gamma7']-(res.MCMCpack[,1]+res.MCMCpack[,'CheeseB']),lower.tail=FALSE),
pnorm(res.MCMCpack[,'gamma7']-(res.MCMCpack[,1]+res.MCMCpack[,'CheeseC']),lower.tail=FALSE),
pnorm(res.MCMCpack[,'gamma7']-(res.MCMCpack[,1]+res.MCMCpack[,'CheeseD']),lower.tail=FALSE)),
Cheese=rep(levels(cheese2$Cheese),each=nrow(res.MCMCpack)))
densityplot(~ probability | Cheese,data=top2box,xlab='Proportion Top 2 Box',
panel= function(x, ...) {
panel.densityplot(x,plot.points='rug' , ...)
} )
#nobs <- nrow(cheese2)/4
#resD <- resC <- resB <- resA <- matrix(0,9,nobs)
#resA[cheese2$Response[cheese2$Cheese=='A']+(0:(nobs-1))*9] <- 1
#resB[cheese2$Response[cheese2$Cheese=='B']+(0:(nobs-1))*9] <- 1
#resC[cheese2$Response[cheese2$Cheese=='C']+(0:(nobs-1))*9] <- 1
#resD[cheese2$Response[cheese2$Cheese=='D']+(0:(nobs-1))*9] <- 1
#cheesevglm <- as.data.frame(t(cbind(resA,resB,resC,resD)))
#cheesevglm$Cheese <- factor(rep(levels(cheese2$Cheese),each=nobs))
#
#Res.vglm <- vglm(cbind(V1,V2,V3,V4,V5,V6,V7,V8,V9) ~ Cheese,cumulative,data=cheesevglm)
#summary(Res.vglm)
#predict(Res.vglm,topred,type='response')
MCMCoprobit
MCMCpack has Bayesian roots. It has many functions, ordinal data is but one of them. It does not rely on JAGS/Winbugs/Openbugs. A difference between MCMCoprobit and the previous functions is the use of probit rather than logit as the link function. I don't think this should stop me from using it.The big disadvantage is that you get raw samples, so additional work is needed to get presentable results. Luckily this allows the calculation of proportions top 2 box I have been yapping about. A new question is how to interpret the big interval for cheese C.
library(MCMCpack)res.MCMCpack <- MCMCoprobit(Response ~ Cheese,data=cheese2)
summary(res.MCMCpack)
Iterations = 1001:11000
Thinning interval = 1
Number of chains = 1
Sample size per chain = 10000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
(Intercept) 3.2423 0.16115 0.0016115 0.016761
CheeseB -1.9334 0.22376 0.0022376 0.019444
CheeseC -0.9782 0.20786 0.0020786 0.008635
CheeseD 0.9382 0.20754 0.0020754 0.005334
gamma2 0.6401 0.13234 0.0013234 0.064949
gamma3 1.3265 0.14265 0.0014265 0.072361
gamma4 1.9697 0.10030 0.0010030 0.036723
gamma5 2.7635 0.09922 0.0009922 0.035434
gamma6 3.3180 0.11913 0.0011913 0.051066
gamma7 4.1606 0.12626 0.0012626 0.057679
gamma8 4.9981 0.11750 0.0011750 0.049354
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
(Intercept) 2.9288 3.1358 3.2430 3.3497 3.5576
CheeseB -2.3606 -2.0869 -1.9372 -1.7850 -1.4926
CheeseC -1.3895 -1.1200 -0.9763 -0.8394 -0.5730
CheeseD 0.5378 0.7979 0.9380 1.0760 1.3434
gamma2 0.4428 0.5227 0.6221 0.7498 0.8878
gamma3 1.1034 1.2156 1.3013 1.4400 1.6292
gamma4 1.7679 1.9017 1.9779 2.0465 2.1424
gamma5 2.5821 2.6867 2.7606 2.8439 2.9328
gamma6 3.1207 3.2393 3.2912 3.3880 3.5808
gamma7 3.8694 4.0840 4.1776 4.2596 4.3487
gamma8 4.8306 4.9177 4.9788 5.0353 5.2884
Correction
The original analysis here was incorrect. It is now corrected. It now shows cheese D as most liked, rather than least liked as before. Note that it was obvious wrong, the first figure in this posting (in green/red) had cheese D as most liked with approximately 50% of the rates in top two box. My apologies for this error.top2box <- data.frame(probability = c(
pnorm(res.MCMCpack[,'gamma7']-(res.MCMCpack[,1]),lower.tail=FALSE),
pnorm(res.MCMCpack[,'gamma7']-(res.MCMCpack[,1]+res.MCMCpack[,'CheeseB']),lower.tail=FALSE),
pnorm(res.MCMCpack[,'gamma7']-(res.MCMCpack[,1]+res.MCMCpack[,'CheeseC']),lower.tail=FALSE),
pnorm(res.MCMCpack[,'gamma7']-(res.MCMCpack[,1]+res.MCMCpack[,'CheeseD']),lower.tail=FALSE)),
Cheese=rep(levels(cheese2$Cheese),each=nrow(res.MCMCpack)))
densityplot(~ probability | Cheese,data=top2box,xlab='Proportion Top 2 Box',
panel= function(x, ...) {
panel.densityplot(x,plot.points='rug' , ...)
} )
This encapsulates what I love and loathe about R - the variety of methods available and the absolute impossibility of knowing which methods are available!
ReplyDeleteThere should be an article like this (wikified, probably) comparing available R methods for each of 2-3,000 common statistical requirements.
As it is, chances are that (a) there's an R method which exactly fits your requirements; and (b) you have no realistic way of finding it.
At least, now I know where to come when I'm looking at ordinal data!
I'll echo @David Pain: Great article! One to pin to the top of google's response to "R how do I analyse ordinal data?" !
ReplyDeleteInteresting that ANOVA was spot on. I think simulations show that with >4 options, violations of normality are innocuous.
PS: The adverts appended to the blog are quite aggressive at opening windows...