Sunday, March 18, 2012

Liking of apples - more than juiciness

In a previous blog it was shown using literature data that liking of apples was related to juiciness. However, there were some questions

  1. Is the relation linear or slightly curved? 
  2. The variation in liking around CJuiciness is large. Are more explanatory variables needed?
  3. So, what drives CJuiciness?
In this post it becomes clear that indeed there is more to liking of apples than just juiciness. 
Error in average scores
The paper of Peneau et al. uses Tukey's post hoc test to examine the differences between the products. The test is performed within the weeks. First we use the data to retrieve at what difference the test shows significant differences.
library(xlsReadWrite)
library(plyr)
library(ggplot2)
library(locfit) 
#get data as before 

datain <- read.xls('condensed.xls')
#remove storage conditions
datain <- datain[-grep('bag|net',datain$Products,ignore.case=TRUE),]
#create week variable
datain$week <-  sapply(strsplit(as.character(datain$Product),'_'),
   function(x) x[[2]])
#function to extract pairwise numerical differences and significances 
extract.diff <- function(descriptor) {
  diffs1 <- ldply(c('W1','W2'),function(Week) {
        value <- 
            as.numeric(gsub('[[:alpha:]]','',
               datain[,descriptor]))[datain$week==Week]
        sig.dif <-    
            gsub('[[:digit:].]','',datain[,descriptor])[datain$week==Week]      
        dif.mat <- outer(value,value,'-')
        sig.mat <- outer(sig.dif,sig.dif,function(X,Y) {
              sapply(1:length(X),function(i) {
                    g1 <- grep(paste('[',X[i],']'),Y[i])
                    length(g1>0)
                  })
            })
        data.frame(dif.val = as.vector(dif.mat), dif.sig = 
                as.vector(sig.mat),Week=Week,descriptor=descriptor)
      })
  diffs1
}



likedif <- extract.diff("CLiking")
likedif <- likedif[likedif$dif.val>=0,]
g <- ggplot(likedif, aes(dif.val,dif.sig))
g + geom_jitter(aes(colour=Week),position=position_jitter(height=.05))

The plot shows a difference of between 0.24 and 0.26 is enough to be significant. For Juiciness, the pattern is the same:

likedif <- extract.diff("CJuiciness")
likedif <- likedif[likedif$dif.val>=0,]
g <- ggplot(likedif, aes(dif.val,dif.sig))
g + geom_jitter(aes(colour=Week),position=position_jitter(height=.05))
In juiciness a difference of 0.24 is enough to be significant. Given all, a difference of 0.24 can be used for both variables. 
Liking vs. Juiciness
The plot with errors is easy to make. For completeness a local fit is added. 
dataval <- datain
vars <- names(dataval)[-1]
for (descriptor in vars) {
  dataval[,descriptor] <- as.numeric(gsub('[[:alpha:]]','',dataval[,descriptor]))
}
l1 <- locfit(CLiking ~ lp(CJuiciness,nn=1),data=dataval)
topred <- data.frame(CJuiciness=seq(3.6,4.8,.1))
topred$CLiking <- predict(l1,topred)
g <- ggplot(dataval,aes(CJuiciness,CLiking))
g <- g  + geom_point() + geom_errorbar(aes(ymin=CLiking-.24,ymax=CLiking+.24))
g <- g + geom_errorbarh(aes(xmin=CJuiciness-.24,xmax=CJuiciness+.24))
g <- g + geom_line(data=topred,colour='blue')
g
Both the local fit and the errors suggest that curvature is interesting to pursue. On top of that, a linear relation has as implication that any increase in juiciness is good. In general an optimum level is expected. Compare with sugar, if you like two lumps of sugar, two is dislike as not sweet enough, while four is too sweet. Hence, again all reason for curvature. 
Regarding inclusion of extra variables, the data shows that the products Juiciness 4.1 are almost significant different. Given that this significance is Tukey HSD, a difference on the one point with much lower liking is probably relevant. On the other hand, the data is fairly well described by curvature, so one extra explaining variable should be enough.
Adding an extra explaining variable
The two prime candidates as second explaining variable are according to the previous calculation CSweetness and CMealiness. In general one would expect apples that are juicy to not be mealy so there is a reason to avoid CMealiness. Nevertheless both are investigated.
l1 <- locfit(CLiking ~ lp(CJuiciness,CSweetness,nn=1.3),data=dataval)
topred <- expand.grid(CJuiciness=seq(3.8,4.6,.1),CSweetness=seq(3.2,3.9,.1))
topred$CLiking <- predict(l1,topred)
v <- ggplot(topred, aes(CJuiciness, CSweetness, z = CLiking)) 
v <- v + stat_contour(aes(colour= ..level..) )
v + geom_point(data=dataval,stat='identity',position='identity',aes(CJuiciness,CSweetness))
l1 <- locfit(CLiking ~ lp(CJuiciness,CMealiness,nn=1.3),data=dataval)
topred <- expand.grid(CJuiciness=seq(3.8,4.6,.1),CMealiness=seq(1.4,2.1,.1))
topred$CLiking <- predict(l1,topred)
v <- ggplot(topred, aes(CJuiciness, CMealiness, z = CLiking)) 
v <- v + stat_contour(aes(colour= ..level..) )
v + geom_point(data=dataval,stat='identity',position='identity',aes(CJuiciness,CMealiness))
The link between CMealiness and CJuiciness is quite strong. It is also clear that CMealiness does not explain the large difference in liking at CJuiciness 4.1. Hence CSweetness is chosen. Not the best of statistical reasons, but all in all it feels like the better model
Simplified linear model
Finally, even though I like the local model, it is more convenient to use a simple linear model. After reduction of non-significant terms, only three factors are left. CJuiciness, CJuiciness^2 and CSweetness.
l1 <- lm(CLiking ~ CJuiciness*CSweetness + I(CJuiciness^2) + I(CSweetness^2),data=dataval)
summary(l1)
Call:
lm(formula = CLiking ~ CJuiciness * CSweetness + I(CJuiciness^2) + 
    I(CSweetness^2), data = dataval)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.12451 -0.02432  0.01002  0.02591  0.06914 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)  
(Intercept)            -6.5326    19.8267  -0.329   0.7475  
CJuiciness              6.7662     4.0414   1.674   0.1199  
CSweetness             -2.5410     8.7807  -0.289   0.7772  
I(CJuiciness^2)        -0.8638     0.3564  -2.424   0.0321 *
I(CSweetness^2)         0.1264     0.9538   0.133   0.8968  
CJuiciness:CSweetness   0.3321     0.6574   0.505   0.6225  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.05827 on 12 degrees of freedom
Multiple R-squared: 0.9107, Adjusted R-squared: 0.8735 
F-statistic: 24.48 on 5 and 12 DF,  p-value: 6.589e-06 
l1 <- lm(CLiking ~ CJuiciness+CSweetness + I(CJuiciness^2) ,data=dataval)
summary(l1)
Call:
lm(formula = CLiking ~ CJuiciness + CSweetness + I(CJuiciness^2), 
    data = dataval)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.125337 -0.023834  0.004955  0.024922  0.087851 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)   
(Intercept)     -14.3609     5.1169  -2.807  0.01400 * 
CJuiciness        8.5997     2.3715   3.626  0.00275 **
CSweetness       -0.2683     0.1104  -2.430  0.02916 * 
I(CJuiciness^2)  -0.9428     0.2795  -3.373  0.00455 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.0547 on 14 degrees of freedom
Multiple R-squared: 0.9082, Adjusted R-squared: 0.8885 
F-statistic: 46.17 on 3 and 14 DF,  p-value: 1.654e-07 
topred <- expand.grid(CJuiciness=seq(3.8,4.6,.05),CSweetness=seq(3.2,3.9,.05))
topred$CLiking <- predict(l1,topred)
v <- ggplot(topred, aes(CJuiciness, CSweetness, z = CLiking))
v <- v + stat_contour(aes(colour= ..level..) )
v + geom_point(data=dataval,stat='identity',position='identity',aes(CJuiciness,CSweetness))
The resulting plot shows quite some difference with the local fit, but this is mostly at the regions without data. 
Hence it is concluded that liking of apples depends mainly on Juiciness, and somewhat on Sweetness. Above a certain Juiciness no gain is to be made. Lower sweetness gives better liking
Discussion
It is a bit disappointing that the error in CJuiciness and CSweetness is not incorporated in the model. Unfortunately, this is easier said than done. The keyword here is Total Least Squares also named Deming regression. Unfortunately these are only viable if two variables are regressed. Curvature is also outside of the scope. 
In addition, this leads to the question, what is error?. The 'error' between the scores consists of different parts. Differences between slices of apple, differences in sensory perception and different ways to score. Regarding the model, differences in slices of apple and sensory perception are counted in liking, while scoring error is not. Ideally, these would be split. This is rather a tall order. 
Hence, two questions are remaining

  1. what drives CJuiciness?
  2. what drives CSweetness?

No comments:

Post a Comment