Monday, April 30, 2012

Bayesian ANOVA for sensory panel profiling data

In this post it is examined if it is possible to use Bayesian methods and specifically JAGS to analyze sensory profiling data. The aim is not to obtain different results, but rather to confirm that the results are fairly similar. The data used is the chocolate data from SensoMineR and the script is adapted from various online sources examined over a longer period.

Building the model

JAGS, (but also WinBugs and OpenBugs) are programs which can be used to provide samples from posterior distributions. It is up to the user to provide data and model to JAGS and interpret the samples. Luckily, R provides infrastructure to help both in setting up models and data and in posterior analysis of the samples. Still, there are some steps to be done, before the analysis can be executed; Setting up data, defining model, initializing variables and deciding which parameters of the model are interesting. Only then JAGS can be called.

Setting up data

Data is any data which goes into JAGS. This includes experimental design, measurements, but also number of rows in the data. For this post I have added some extra data, since I want to compare differences between product means. As I want to compare those, I need to have samples from these specific distributions. All the data needs to go into one big list, which will be given to JAGS later on.
# get libraries
library(SensoMineR)
library(coda)
library(R2jags)

# infrastructure for contrasts
FullContrast <- function(n) {
  UseMethod('FullContrast',n)
}
FullContrast.default <- function(n) stop('FullContrast only takes integer and factors')
FullContrast.integer <- function(n) {
  mygrid <- expand.grid(v1=1:n,v2=1:n)
  mygrid <- mygrid[mygrid$v1<mygrid$v2,]
  rownames(mygrid) <- 1:nrow(mygrid)
  as.matrix(mygrid)
}
FullContrast.factor <- function(n) {
  FullContrast(nlevels(n))
}

# reading data
data(chocolates)

# setting up experimental data
data_list <- with(sensochoc,list(Panelist = as.numeric(Panelist),
                              nPanelist = nlevels(Panelist),
                              Product = as.numeric(Product),
                              nProduct = nlevels(Product),
                              y=CocoaA,
                              N=nrow(sensochoc)))
# contrasts
data_list$Panelistcontr <- FullContrast(sensochoc$Panelist)
data_list$nPanelistcontr <- nrow(data_list$Panelistcontr)
data_list$Productcontr <-  FullContrast(sensochoc$Product) 
data_list$nProductcontr <- nrow(data_list$Productcontr)

Model definition

The model can be written in 'plain' R and then given to JAGS. However, JAGS does not have vector operations, hence there are a lot of for loops which would be unacceptable for normal R usage. Besides the additive effects in the first part of the model, there are quite some extras. all the means in the model are coming out of hyperdistributions. In this case these are normal distributed. The precision (and hence variance) of these hyperdistributions are determined on basis of the data. The final part of the model translates the internal parameters into something which is sensible to interpret. The model also needs to be written into a file so JAGS can use it later on, these are the last two lines.



mymodel <- function() {
  # core of the model  
  for (i in 1:N) {
    fit[i] <- grandmean +  mPanelist[Panelist[i]] + mProduct[Product[i]] + mPanelistProduct[Panelist[i],Product[i]]
    y[i] ~ dnorm(fit[i],tau)
  }
  # grand mean and residual 
  tau ~ dgamma(0.001,0.001)
  gsd <-  sqrt(1/tau)
  grandmean ~ dnorm(0,.001)
  # variable Panelist distribution  
  mPanelist[1] <- 0
  for (i in 2:nPanelist) {
    mPanelist[i] ~ dnorm(offsetPanelist,tauPanelist) 
  }
  offsetPanelist ~ dnorm(0,.001)
  tauPanelist ~ dgamma(0.001,0.001)
  sdPanelist <- sqrt(1/tauPanelist)
  # Product distribution 
  mProduct[1] <- 0
  for (i in 2:nProduct) {
    mProduct[i] ~ dnorm(offsetProduct,tauProduct)
  }
  offsetProduct ~ dnorm(0,0.001)
  tauProduct ~ dgamma(0.001,0.001)
  sdProduct <- sqrt( 1/tauProduct)
  # interaction distribution
  for (i in 1:nPanelist) {
    mPanelistProduct[i,1] <- 0
  }
  for (i in 2:nProduct) {
    mPanelistProduct[1,i] <- 0
  }
  for (iPa in 2:nPanelist) {
    for (iPr in 2:nProduct) {
      mPanelistProduct[iPa,iPr] ~dnorm(offsetPP,tauPP)
    }
  }
  offsetPP ~dnorm(0,0.001)
  tauPP ~dgamma(0.001,0.001)
  sdPP <- 1/sqrt(tauPP)
  # getting the interesting data
  # true means for Panelist
  for (i in 1:nPanelist) {
    meanPanelist[i] <- grandmean + mPanelist[i] + mean(mPanelistProduct[i,1:nProduct]) + mean(mProduct[1:nProduct])
  }
  # true means for Product
  for (i in 1:nProduct) {
    meanProduct[i] <- grandmean + mProduct[i] + mean(mPanelistProduct[1:nPanelist,i]) + mean(mPanelist[1:nPanelist])
  }
  for (i in 1:nPanelistcontr) {
    Panelistdiff[i] <- meanPanelist[Panelistcontr[i,1]]-meanPanelist[Panelistcontr[i,2]]
  }
  for (i in 1:nProductcontr) {
    Productdiff[i] <- meanProduct[Productcontr[i,1]]-meanProduct[Productcontr[i,2]]
  }
}

model.file <- file.path(tempdir(),'mijnmodel.txt')
write.model(mymodel,con=model.file)

Initializing variables

Anything values in the model which are not provided by the data, needs to be initialized. It is most convenient to setup a little model which can be used to get these values.

inits <- function() list(
  grandmean = rnorm(1,3,1),
  mPanelist = c(0,rnorm(data_list$nPanelist-1)) ,
  mProduct = c(0,rnorm(data_list$nProduct-1)) ,
  mPanelistProduct = rbind(rep(0,data_list$nProduct),cbind(rep(0,data_list$nPanelist-1),matrix(rnorm((data_list$nPanelist-1)*(data_list$nProduct-1)),nrow=data_list$nPanelist-1,ncol=data_list$nProduct-1)))
,
  tau = runif(1,1,2),
  tauPanelist = runif(1,1,3),
  tauProduct = runif(1,1,3)
  )

parameters of interest and calling JAGS

The parameters of interest is basically anything which we want know anything about. The JAGS call, is just listing all the parts provided before to JAGS. In this case, the model runs fairly quick, so I decided to have some extra iterations (n.iter) and an extra chain. For this moment, I decided not to calculate DIC.
parameters <- c('sdPanelist','sdProduct','gsd','meanPanelist',
  'meanProduct','Productdiff','sdPP')
jagsfit <- jags(data=data_list,inits=inits,model.file=model.file,
   parameters.to.save=parameters,n.chains=4,DIC=FALSE,n.iter=10000)

Result

The first part of the result can be obtained via a simple print of jagsfit
jagsfit

Inference for Bugs model at "/tmp/Rtmp0VFW9g/mijnmodel.txt", fit using jags,
 4 chains, each with 10000 iterations (first 5000 discarded), n.thin = 5
 n.sims = 4000 iterations saved
                 mu.vect sd.vect   2.5%    25%    50%    75%  97.5%  Rhat n.eff
Productdiff[1]     0.562   0.312 -0.037  0.356  0.558  0.773  1.186 1.001  4000
Productdiff[2]     2.303   0.317  1.697  2.087  2.299  2.512  2.932 1.001  4000
Productdiff[3]     1.741   0.314  1.118  1.530  1.748  1.948  2.346 1.001  3700
Productdiff[4]     0.836   0.307  0.224  0.630  0.838  1.044  1.424 1.002  2000
Productdiff[5]     0.274   0.303 -0.322  0.069  0.273  0.478  0.870 1.001  2600
Productdiff[6]    -1.467   0.313 -2.081 -1.671 -1.465 -1.257 -0.865 1.001  2700
Productdiff[7]     0.339   0.308 -0.263  0.130  0.338  0.537  0.951 1.001  4000
Productdiff[8]    -0.223   0.299 -0.807 -0.430 -0.218 -0.021  0.349 1.001  4000
Productdiff[9]    -1.965   0.319 -2.587 -2.183 -1.966 -1.757 -1.322 1.001  4000
Productdiff[10]   -0.497   0.294 -1.068 -0.699 -0.495 -0.300  0.082 1.002  1600
Productdiff[11]    0.740   0.317  0.130  0.525  0.733  0.952  1.374 1.001  4000
Productdiff[12]    0.177   0.300 -0.412 -0.019  0.178  0.382  0.753 1.001  4000
Productdiff[13]   -1.564   0.322 -2.186 -1.776 -1.570 -1.349 -0.913 1.001  3300
Productdiff[14]   -0.097   0.300 -0.686 -0.300 -0.090  0.102  0.505 1.002  2100
Productdiff[15]    0.401   0.300 -0.180  0.197  0.405  0.597  0.997 1.001  4000
gsd                1.689   0.067  1.563  1.644  1.687  1.734  1.821 1.002  1700
meanPanelist[1]    6.667   0.492  5.680  6.334  6.659  7.002  7.601 1.001  3100
meanPanelist[2]    5.513   0.436  4.660  5.219  5.512  5.808  6.369 1.001  4000
meanPanelist[3]    6.325   0.443  5.439  6.031  6.331  6.624  7.171 1.001  4000
meanPanelist[4]    7.394   0.443  6.542  7.094  7.400  7.703  8.262 1.002  1800
meanPanelist[5]    7.104   0.445  6.227  6.807  7.104  7.405  7.982 1.001  4000
meanPanelist[6]    6.201   0.436  5.328  5.912  6.208  6.497  7.034 1.001  4000
meanPanelist[7]    6.386   0.444  5.521  6.078  6.384  6.688  7.248 1.003  1100
meanPanelist[8]    6.451   0.443  5.562  6.160  6.452  6.760  7.301 1.001  4000
meanPanelist[9]    5.184   0.441  4.334  4.882  5.184  5.480  6.050 1.002  2300
meanPanelist[10]   8.110   0.460  7.212  7.815  8.107  8.411  9.020 1.001  3700
meanPanelist[11]   5.122   0.448  4.227  4.811  5.121  5.433  5.989 1.002  2100
meanPanelist[12]   7.191   0.441  6.327  6.891  7.181  7.492  8.061 1.001  2900
meanPanelist[13]   6.719   0.438  5.893  6.421  6.720  7.013  7.563 1.001  4000
meanPanelist[14]   6.266   0.440  5.396  5.978  6.266  6.553  7.138 1.002  1700
meanPanelist[15]   6.859   0.434  6.002  6.571  6.867  7.148  7.716 1.002  2200
meanPanelist[16]   6.079   0.434  5.238  5.783  6.082  6.371  6.930 1.002  2100
meanPanelist[17]   6.054   0.433  5.213  5.769  6.051  6.345  6.893 1.002  1800
meanPanelist[18]   6.122   0.420  5.281  5.844  6.129  6.398  6.939 1.002  2200
meanPanelist[19]   6.185   0.438  5.324  5.888  6.182  6.479  7.064 1.002  1500
meanPanelist[20]   4.988   0.456  4.088  4.680  4.997  5.292  5.873 1.001  4000
meanPanelist[21]   4.994   0.455  4.064  4.689  4.996  5.305  5.884 1.001  3400
meanPanelist[22]   5.062   0.451  4.177  4.763  5.065  5.370  5.944 1.001  2600
meanPanelist[23]   7.171   0.453  6.296  6.855  7.172  7.480  8.057 1.001  4000
meanPanelist[24]   5.663   0.442  4.799  5.368  5.671  5.959  6.532 1.002  2400
meanPanelist[25]   7.514   0.455  6.627  7.210  7.518  7.815  8.425 1.001  4000
meanPanelist[26]   6.320   0.439  5.432  6.023  6.325  6.624  7.156 1.001  2900
meanPanelist[27]   6.853   0.431  6.013  6.562  6.843  7.144  7.705 1.001  4000
meanPanelist[28]   7.045   0.440  6.197  6.741  7.043  7.340  7.926 1.001  4000
meanPanelist[29]   4.789   0.456  3.915  4.472  4.789  5.096  5.687 1.001  4000
meanProduct[1]     7.084   0.223  6.653  6.937  7.081  7.230  7.539 1.001  4000
meanProduct[2]     6.522   0.215  6.097  6.375  6.524  6.669  6.928 1.001  4000
meanProduct[3]     4.781   0.227  4.332  4.626  4.778  4.927  5.227 1.001  4000
meanProduct[4]     6.248   0.213  5.838  6.101  6.248  6.394  6.660 1.002  1500
meanProduct[5]     6.745   0.217  6.317  6.605  6.748  6.885  7.171 1.001  4000
meanProduct[6]     6.344   0.221  5.910  6.199  6.347  6.494  6.772 1.001  4000
sdPP               0.140   0.123  0.024  0.050  0.096  0.193  0.461 1.010   290
sdPanelist         0.996   0.178  0.700  0.869  0.978  1.106  1.393 1.001  4000
sdProduct          0.996   0.536  0.426  0.670  0.864  1.164  2.293 1.001  4000

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
It is a big table, and it is needed to extract the required data from it. A good way is to plot the results. There are two plots to start, a quick summary and extensive plots. As the second plot command makes one figure per four variables, it is omitted. In the figure, it is observed that some of the product differences are different from 0, this means that it is believed these differences are present. From meanProducts it seems product 3 is quite lower than the other products. There is also quite some variation in meanPanelist. To be specific, panelist 10 scores high, while 9 and 11 score low.
Variables gsd and sdPanelist might be used to examine panel performance, but to examine this better, they should be compared with results from other descriptors.
plot(jagsfit)

Details about products

A main question if obviously, which products are different? For this we can extract some data from a summary
jagsfit.mc <- as.mcmc(jagsfit)
# plot(jagsfit.mc) # this plot give too many figures for the blog
fitsummary <- summary(jagsfit.mc)
# extract differences
Productdiff <- fitsummary$quantiles[ grep('Productdiff',rownames(fitsummary$quantiles)),]
# extract differences different from 0
data_list$Productcontr[Productdiff[,1]>0 | Productdiff[,5]<0,]
# get the product means
ProductMean <- fitsummary$statistics[ grep('meanProduct',rownames(fitsummary$quantiles)),]
rownames(ProductMean) <- levels(sensochoc$Product)
ProductMean
The result shows us a table of product pairs which are different; most of these are related to product 3, but also product 1 is different from 4 and 6.  

> jagsfit.mc <- as.mcmc(jagsfit)
> # plot(jagsfit.mc)
> fitsummary <- summary(jagsfit.mc)
> # extract differences
> Productdiff <- fitsummary$quantiles[ grep('Productdiff',rownames(fitsummary$quantiles)),]
> # extract differences different from 0
> data_list$Productcontr[Productdiff[,1]>0 | Productdiff[,5]<0,]
   v1 v2
2   1  3
3   2  3
4   1  4
6   3  4
9   3  5
11  1  6
13  3  6
> # get the product means > ProductMean <- fitsummary$statistics[ grep('meanProduct',rownames(fitsummary$quantiles)),] > rownames(ProductMean) <- levels(sensochoc$Product) > ProductMean
          Mean        SD    Naive SE Time-series SE
choc1 7.083966 0.2225106 0.003518201    0.003233152
choc2 6.521746 0.2150476 0.003400201    0.003988800
choc3 4.780643 0.2272578 0.003593261    0.003553964
choc4 6.247723 0.2128971 0.003366198    0.003382249
choc5 6.745146 0.2165525 0.003423996    0.003230432
choc6 6.344260 0.2206372 0.003488580    0.003662731

Comparison with ANOVA

A few lines in R will give the standard analysis. The product means are very close. This ANOVA shows only differences involving product 3. This is probably due to usage of TukeyHSD, which can be a bit conservative in the ANOVA while the comparison in the Bayesian model is unprotected.


> library(car)
> aa <- aov(CocoaA ~ Product * Panelist,data=sensochoc)
> Anova(aa)
Anova Table (Type II tests)

Response: CocoaA
                 Sum Sq  Df F value    Pr(>F)    
Product          207.54   5 12.6045 1.876e-10 ***
Panelist         390.43  28  4.2343 1.670e-09 ***
Product:Panelist 322.29 140  0.6991    0.9861    
Residuals        573.00 174                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
> TukeyHSD(aa,'Product')
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = CocoaA ~ Product * Panelist, data = sensochoc)

$Product
                  diff        lwr        upr     p adj
choc2-choc1 -0.5344828 -1.5055716  0.4366061 0.6088175
choc3-choc1 -2.4137931 -3.3848819 -1.4427043 0.0000000
choc4-choc1 -0.8275862 -1.7986750  0.1435026 0.1433035
choc5-choc1 -0.2931034 -1.2641923  0.6779854 0.9531915
choc6-choc1 -0.7241379 -1.6952268  0.2469509 0.2674547
choc3-choc2 -1.8793103 -2.8503992 -0.9082215 0.0000014
choc4-choc2 -0.2931034 -1.2641923  0.6779854 0.9531915
choc5-choc2  0.2413793 -0.7297095  1.2124682 0.9797619
choc6-choc2 -0.1896552 -1.1607440  0.7814337 0.9932446
choc4-choc3  1.5862069  0.6151181  2.5572957 0.0000742
choc5-choc3  2.1206897  1.1496008  3.0917785 0.0000000
choc6-choc3  1.6896552  0.7185663  2.6607440 0.0000192
choc5-choc4  0.5344828 -0.4366061  1.5055716 0.6088175
choc6-choc4  0.1034483 -0.8676406  1.0745371 0.9996304
choc6-choc5 -0.4310345 -1.4021233  0.5400544 0.7960685

> model.tables(aa,type='mean',cterms='Product')
Tables of means
Grand mean
         
6.287356 

 Product 
Product
choc1 choc2 choc3 choc4 choc5 choc6 
7.086 6.552 4.672 6.259 6.793 6.362 

Conclusion

JAGS can be used to analyzed sensory profiling data. However, if a simple model such as two way ANOVA is used, it does not seem to be worth the trouble.

Monday, April 23, 2012

Tuning GAMBoost

This post describes some of the simulation results which I obtained with the GAMBoost package. The aim of these simulations is to get a feel what I should tune and what I should not tune with GAMBoost. 

Setup

In the GAMBoost package one can tune quite a number of parameters. I have looked at tuning bdeg (degree of the B-spline basis to be used for fitting smooth functions, values 1, 2 and 3), penalty (penalty value for the update of an individual smooth function in each boosting step, values 10, 100 and 1000) when using cv.GAMBoost to determine the optimum number of Boosting steps. As a second part, optimGAMBoostPenalty was used as a less computational burdening technique.


All data had 11 independent variables. 
The X matrix was filled with normal distributed random data. The number of objects in the training set was 25, 50, 100 or 200. The test set had 25 objects and was used to get a measure for Mean Squared Error of Prediction (RMSEP).
Three data generating models have been used, linear, quadratic and complex. 
y <- xmat[,1]*1 + rnorm(nrow)
y <- xmat[,1]*1 + xmat[,2]^2 + 5*sqrt(3.5+xmat[,3]) + rnorm(nrow)

y <- xmat[,1]*1 + xmat[,2]^2 + 5*sqrt(3.5+xmat[,3]) + 6*sin(3*xmat[,4] + 4*cos(6*xmat[,5])) + rnorm(nrow)
Each simulation was run 3 times, to get a feel of the variation in the results.

Key Learning

  • cv.GAMBoost will provide a reasonable model, independent of penalty or bdeg used. 
  • Calculation time for cross validation is quite big and is easily prohibitive.
  • optimGAMBoostPenalty gives a faster result, but also bigger RMSEP.
  • Overly complex relations are not so easily captured.
  • GAMBoost needs quite a large number of training objects to perform.

Detailed Results

  • 25 training objects was clearly not enough. 50 training objects was just not enough. 
    • When the training set contains 25 objects, the variation in RMSEP between the three repeats was quite big. This suggests that the result was dependent on the realization of the random x matrix. This effect is much less with a 50 object training set. 
  • 100 to 200 training objects was enough for the linear generated response
    • With 100 and 200 objects in the training set the linear generated response resulted in RMSEP of 1 to 1.3. This is close to expected.
  • 100 to 200 training objects was enough for quadratic generated response. But the RMSEP was higher than linear generated response.
    • With 100 to 200 objects RMSE of 1.2 to 1.6. 
    • 25 to 100 objects in training set RMSEP of 1.8 to 2.2
  • The complex model was never well predicted
    • RMSEP of 4 to 6. 
    • The RMSEP was slightly lower for 200 training objects than for 50 training objects, but not impressively better. The 200 training objects were not enough.
  • Penalty and bdeg not overly important. 
    • A lower penalty resulted in less boosting steps, but the RMSEP was not systematically influenced. cv.GAMBoost compensates with less steps for a lower penalty
    • Out of the 9 combinations penalty/bdeg, the model with best AIC did not give the best RMSEP.
  • Calculation time of the 200 objects was just feasible. 400 objects would need a lot of patience or better (newer) hardware than I am using. 
    • Parallel execution works for cv.GAMBoost. If only I had more than two cores in my computer.
  • optimGAMBoostPenalty same or higher RMSEP as cv.GAMBoost.
    • In the cases where cv.GAMBoost gave an RMSEP clearly over 1.5, optimGAMBoost performed comparable. 
    • Even in the high data situation, linear generated response and 200 training objects, cv.GAMBoost had lower RMSEP than optimGAMBoostPenalty.

R code

The code is a bit rough especially in output.
library(GAMBoost)
library(snow)
library(snowfall)
sfInit(parallel=TRUE,cpus=2)
rmse <- function(a,b) sqrt(mean((a-b)^2))

lapply(1:3, function(outer) {
ntrain <- 25
ntest <- 50
nrow <- ntrain + ntest
ncol=11

xmat <- matrix(rnorm(nrow*ncol),nrow=nrow,ncol=ncol)
colnames(xmat) <- paste('V', 1:ncol,sep='')
yvec <- xmat[,1]*1 + rnorm(nrow)+ xmat[,2]^2 + 5*sqrt(3.5+xmat[,3])+ 6*sin(3*xmat[,4] + 4*cos(6*xmat[,5]))

alldf <- as.data.frame(cbind(yvec,xmat))

xtrain <- xmat[1:ntrain,]
ytrain <- yvec[1:ntrain]
dftrain <- alldf[1:ntrain,]
xtest <- xmat[(ntrain+1):(ntrain+ntest),]
ytest <- yvec[(ntrain+1):(ntrain+ntest)]
dftest <- alldf[(ntrain+1):(ntrain+ntest),]
mysettings <- expand.grid(penalty=c(10,100,1000),maxstepno=1000,bdeg=c(1,2,3))
myres <- lapply(1:nrow(mysettings),function(x) {
   cvgm <- cv.GAMBoost(x=xtrain,y=ytrain,family=gaussian(),parallel=TRUE,
                       penalty=mysettings$penalty[x],
                       maxstepno=mysettings$maxstepno[x],
                       bdeg=mysettings$bdeg[x])
} )
mysettings$which.min.aic <- sapply(myres,function(x) {
  which.min(x$AIC)
})
mysettings$min.aic <- sapply(myres,function(x) {
  min(x$AIC)
})

mysettings$rmsep <- sapply(myres,function(x) {
  ypredgm <- predict(x,xtest)
  rmse(ypredgm,ytest)
})

  op <- optimGAMBoostPenalty(x=xtrain,y=ytrain,family=gaussian(),maxstepno=1000,parallel=TRUE,trace=FALSE)
  ypredop <- predict(op,xtest)
  
  list(mysettings=mysettings,c(optpenalty=op$penalty,oprmse=rmse(ypredop,ytest)) )
                    
} )






Sunday, April 8, 2012

Predicting apple liking from instrumental data

In this post I will examine if the individual models from previous posts together would make a good predictive model. Obviously, predictive capability is different from descriptive capability, predictions usually have a more simple model. As there was no test set the predictive capability will be analyzed using cross-validation. It should be noted, as one of the apples had no instrumental data, this sample cannot be used for the model.
It is found that the combination of models has a prediction error slightly worse compared to a simple PLS model using physical chemical data. This is not so bad. On the one hand, PLS is very much a predictive method, so, it should be performing better. On the other hand, PLS does not take curvature into account. This means that the PLS model has a handicap for some of the details. It would seem that these two balance out. Finally, the separate models can be interpreted better than a PLS model.

Model fit

The model uses all the separate parts which were defined previously in this blog. For ease of use this is all placed in a small function. A second function is used to generate predictions. The first figure shows the predictions using all data. In addition the ideal line (y=x) is added.
The normal predictions are not very impressive. Especially, the least liked products are not found least liked at all. Part of the problem may be in an intermediate step. CJuiciness, which is the juiciness observed by the consumers, is not well predicted. As it was earlier discovered that CJuiciness is the main explaining factor for  liking, this may explain part of the lack of prediction power. 

Model prediction capability using cross validation

Cross validation predictions are given in the second figure. The sub figures make very clear that the model does not do a good job of predicting new data. Again, Juiciness is a big part of the problem, now both the sensory observed juiciness and the consumer observed juiciness. 

Prediction error

For completeness, the prediction error is calculated. The Root Mean Squared Error of Prediction is 0.170. As a comparison, a PLS model of liking from physical chemical data has a slightly better prediction error. This makes the model defined at least a competitor to PLS. It also shows that predicting the liking poses a difficult problem.
Data: X dimension: 17 4 
Y dimension: 17 1
Fit method: kernelpls
Number of components considered: 4

VALIDATION: RMSEP
Cross-validated using 17 leave-one-out segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps
CV           0.174   0.1626   0.1617   0.2029   0.2007
adjCV        0.174   0.1615   0.1610   0.2005   0.1984

TRAINING: % variance explained
         1 comps  2 comps  3 comps  4 comps
X          34.74    80.63    87.17   100.00
CLiking    33.80    36.71    40.89    40.96

R code

library(xlsReadWrite)
library(gam)
library(plyr)
datain <- read.xls('condensed.xls')
datain <- datain[-grep('bag|net',datain$Products,ignore.case=TRUE),]
datain$week <-  sapply(strsplit(as.character(datain$Product),'_'),
    function(x) x[[2]])
dataval <- datain
vars <- names(dataval)[-1]
for (descriptor in vars) {
  dataval[,descriptor] <- as.numeric(gsub('[[:alpha:]]','',
          dataval[,descriptor]))
}

# all fits in a model
FitCombinedModel <- function(dataval) {
# link instrument to sensory
  MJuicsSEL3 <- gam(SJuiciness ~ AInstrumental.firmness + s(AJuice.release,3) + 
          s(ATitratable.acidity,3) ,data=dataval)
  MSweetSEL <- gam(SSweetness ~ AInstrumental.firmness + s(AJuice.release,3) + 
          ASoluble.solids + ATitratable.acidity  ,data=dataval)
  MCrispSEL <- gam(SCrispness ~ s(AInstrumental.firmness,3) + AJuice.release + 
          ASoluble.solids + ATitratable.acidity ,data=dataval)
  MMealiSEL <- gam(SMealiness ~ s(AInstrumental.firmness,3) + AJuice.release + 
          ASoluble.solids + ATitratable.acidity  ,data=dataval)
# link sensory to consumer experience
  MCJuic <- lm(CJuiciness ~ SJuiciness + SSweetness,data=dataval)
  MCSweet <- lm(CSweetness ~ SSweetness + SCrispness ,data=dataval)
# link consumer experience to liking
  Mliking <- lm(CLiking ~ CJuiciness+CSweetness + I(CJuiciness^2) ,data=dataval)
 return(list(MJuicsSEL3=MJuicsSEL3,MSweetSEL=MSweetSEL,MCrispSEL=MCrispSEL,
    MMealiSEL=MMealiSEL,MCJuic=MCJuic,MCSweet=MCSweet,Mliking=Mliking ))
}
PredictCombinedModel <- function(modellist,newdata) {
  preddata <- data.frame(SJuiciness = predict(modellist$MJuicsSEL3,newdata),
      SSweetness =  predict(modellist$MSweetSEL,newdata),
      SCrispness = predict(modellist$MCrispSEL,newdata),
      SMealiness = predict(modellist$MMealiSEL,newdata))
  preddata$CJuiciness <- predict(modellist$MCJuic,preddata)
  preddata$CSweetness <- predict(modellist$MCSweet,preddata)
  preddata$CLiking <- predict(modellist$Mliking,preddata)
  preddata
}

FCM <- FitCombinedModel(dataval) 
datain.nomiss <- dataval[-1,]

predictions <- PredictCombinedModel(FCM,datain.nomiss)
par(mfrow=c(2,3))
m_ply(c('SJuiciness','SSweetness','SCrispness','CJuiciness',
        'CSweetness','CLiking'),function(x) {
   plot(y=predictions[,x],x=datain.nomiss[,x],ylab=paste('predicted',x),
      xlab=paste('observed',x))
   abline(a=0,b=1)
   })

#cross validation section
lopred <- mdply(1:nrow(datain.nomiss),function(x) {
      datain.nomisslo <- datain.nomiss[-x,]
      lomodel <- FitCombinedModel(datain.nomisslo)
      PredictCombinedModel(lomodel,datain.nomiss[x,])
    })
#lopred <- do.call(rbind,lopred)
par(mfrow=c(2,3))
m_ply(c('SJuiciness','SSweetness','SCrispness','CJuiciness',
        'CSweetness','CLiking'),function(x) {
   plot(y=lopred[,x],x=datain.nomiss[,x],
     ylab=paste('cv predicted',x),xlab=paste('observed',x))
   abline(a=0,b=1)
})
# root mean squared error of prediction
sqrt(mean((lopred$CLiking-datain.nomiss$CLiking)^2))

#pls as benchmark
library(pls)
plsmodel <- plsr(CLiking ~ AInstrumental.firmness + AJuice.release + ASoluble.solids + ATitratable.acidity,data=datain.nomiss,
    scale=TRUE,validation='LOO' )
summary(plsmodel)

Monday, April 2, 2012

Linking apples liking to analytical data

This post describes the last puzzle piece of the model. The link of instrumental to sensory data. Together with the previous pieces this leads to a model starting from physico-chemical measurements, to sensory data, to consumers' perception and finally liking.
In this post I choose to use generalized additive models (GAM). The aim is to build models for SJuiciness, SSweetness, SCrispness and SMealiness. The latter because in the previous post it was possible that it has interest, so I want to keep it in the picture.
GAM appears to be a relatively simple to apply method, however, even with four variables the 17 rows in the data is too low to allow an extensive model, meaning that I had to restrict the smoother factor.

Data preparation and SJuiciness

library(xlsReadWrite)
library(ggplot2)
library(gam)
datain <- read.xls('condensed.xls')
#remove storage conditions
datain <- datain[-grep('bag|net',datain$Products,ignore.case=TRUE),]
datain$week <-  sapply(strsplit(as.character(datain$Product),'_'),
    function(x) x[[2]])
dataval <- datain
vars <- names(dataval)[-1]
for (descriptor in vars) {
  dataval[,descriptor] <- as.numeric(gsub('[[:alpha:]]','',
          dataval[,descriptor]))
}
#CJuiciness
indepV <- grep('^A',vars,value=TRUE)
MJuicL <- gam(SJuiciness ~ AInstrumental.firmness + AJuice.release + 
        ASoluble.solids + ATitratable.acidity ,data=dataval)
MJuics2 <- gam(SJuiciness ~ s(AInstrumental.firmness,2) + s(AJuice.release,2) + 
        s(ASoluble.solids,2) + s(ATitratable.acidity,2) ,data=dataval)
MJuics3 <- gam(SJuiciness ~ s(AInstrumental.firmness,3) + s(AJuice.release,3) + 
        s(ASoluble.solids,3) + s(ATitratable.acidity,3) ,data=dataval)
summary(MJuicL)
Call: gam(formula = SJuiciness ~ AInstrumental.firmness + AJuice.release + 
    ASoluble.solids + ATitratable.acidity, data = dataval)
Deviance Residuals:
    Min      1Q  Median      3Q     Max 
-5.9405 -2.5110 -0.6978  3.1524  7.9597 

(Dispersion Parameter for gaussian family taken to be 18.5651)

    Null Deviance: 1002.575 on 16 degrees of freedom
Residual Deviance: 222.7816 on 12 degrees of freedom
AIC: 103.9845 
1 observation deleted due to missingness 

Number of Local Scoring Iterations: 2 
DF for Terms


                       Df
(Intercept)             1
AInstrumental.firmness  1
AJuice.release          1
ASoluble.solids         1
ATitratable.acidity     1

summary(MJuics2)
Call: gam(formula = SJuiciness ~ s(AInstrumental.firmness, 2) + s(AJuice.release, 
    2) + s(ASoluble.solids, 2) + s(ATitratable.acidity, 2), data = dataval)
Deviance Residuals:
    Min      1Q  Median      3Q     Max 
-5.3781 -2.1730 -0.8181  1.9755  6.2297 

(Dispersion Parameter for gaussian family taken to be 19.4993)

    Null Deviance: 1002.575 on 16 degrees of freedom
Residual Deviance: 155.997 on 8.0002 degrees of freedom
AIC: 105.9262 
1 observation deleted due to missingness 

Number of Local Scoring Iterations: 2 

DF for Terms and F-values for Nonparametric Effects

                             Df Npar Df  Npar F  Pr(F)
(Intercept)                   1                       
s(AInstrumental.firmness, 2)  1       1 0.77410 0.4046
s(AJuice.release, 2)          1       1 1.14455 0.3159
s(ASoluble.solids, 2)         1       1 0.40054 0.5444
s(ATitratable.acidity, 2)     1       1 1.04256 0.3371

summary(MJuics3)

Call: gam(formula = SJuiciness ~ s(AInstrumental.firmness, 3) + s(AJuice.release, 
    3) + s(ASoluble.solids, 3) + s(ATitratable.acidity, 3), data = dataval)
Deviance Residuals:
       2        3        4        5        6        7        8        9 
 4.06906 -0.28794 -1.14197 -0.20378 -1.77053  0.68596  2.67061 -0.38020 
      12       13       14       15       16       17       18       19 
-4.06569  3.64093 -1.10219 -0.08048  1.03191 -1.13588  2.64431 -3.02135 
      20 
-1.55275 

(Dispersion Parameter for gaussian family taken to be 20.1907)

    Null Deviance: 1002.575 on 16 degrees of freedom
Residual Deviance: 80.7624 on 4 degrees of freedom
AIC: 102.735 
1 observation deleted due to missingness 

Number of Local Scoring Iterations: 4 

DF for Terms and F-values for Nonparametric Effects

                             Df Npar Df  Npar F  Pr(F)
(Intercept)                   1                       
s(AInstrumental.firmness, 3)  1       2 0.78051 0.5174
s(AJuice.release, 3)          1       2 1.04430 0.4316
s(ASoluble.solids, 3)         1       2 0.31434 0.7468
s(ATitratable.acidity, 3)     1       2 1.25657 0.3772

anova(MJuicL3,MJuics2,MJuicL)
Analysis of Deviance Table

Model 1: SJuiciness ~ s(AInstrumental.firmness, 3) + s(AJuice.release, 
    3) + s(ASoluble.solids, 3) + s(ATitratable.acidity, 3)
Model 2: SJuiciness ~ s(AInstrumental.firmness, 2) + s(AJuice.release, 
    2) + s(ASoluble.solids, 2) + s(ATitratable.acidity, 2)
Model 3: SJuiciness ~ AInstrumental.firmness + AJuice.release + ASoluble.solids + 
    ATitratable.acidity
  Resid. Df Resid. Dev      Df Deviance Pr(>Chi)
1    4.0000     80.762                          
2    8.0002    155.997 -4.0002  -75.235   0.4444
3   12.0000    222.782 -3.9998  -66.785   0.5077
par(mfrow=c(2,2))
plot(MJuics2,se=TRUE)

plot(MJuics3,se=TRUE)
From this plot and tables it was concluded that ASoluble.solids is not needed. AJuice.release and ATitrable.Acidity may have non-linear effects and AInstrumental.firmness has a linear effect. Based on this, two models are defined, with different levels of curvature.
MJuicsSEL2 <- gam(SJuiciness ~ AInstrumental.firmness + s(AJuice.release,2) + 
         s(ATitratable.acidity,2) ,data=dataval)
MJuicsSEL3 <- gam(SJuiciness ~ AInstrumental.firmness + s(AJuice.release,3) + 
         s(ATitratable.acidity,3) ,data=dataval)
anova(MJuicL,MJuicsSEL2,MJuicsSEL3) 
Analysis of Deviance Table

Model 1: SJuiciness ~ AInstrumental.firmness + AJuice.release + ASoluble.solids + 
    ATitratable.acidity
Model 2: SJuiciness ~ AInstrumental.firmness + s(AJuice.release, 2) + 
    s(ATitratable.acidity, 2)
Model 3: SJuiciness ~ AInstrumental.firmness + s(AJuice.release, 3) + 
    s(ATitratable.acidity, 3)
  Resid. Df Resid. Dev      Df Deviance Pr(>Chi)  
1        12     222.78                            
2        11     177.48 0.99998   45.302  0.05808 .
3         9     113.53 2.00006   63.953  0.07927 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
From this it is concluded that the model MJuicsSEL3 is the best model to predict SJuiciness.

SSweetness, SCrispness and SMealiness

For briefness, it is chosen not to show all the results of the other three responses. The plot of the final models is shown 
Ssweetness is mainly related to AJuicerelease (3 df) and smaller linear effects of the other three variables.

SCrispness is non-linear related to AInstrumental.firmness (3 df), with the other variables much less influential linear effects.
SMealiness again a 3 df smoother with much less influential effects.

Discussion

GAM

GAM is suitable method to examine models where there are no (expected) interactions, but there are expected non-linear effects. It is simple to use. It shows which of the variables have the larger effects and how these effects approximately look like. However, it does have its problems. For instance, there is an artifact in AInstrumental.firmness around 60 where there is a small bump in the responses. If this is a representation of a physical reality, then the rest of the curves seems fairly smooth. It would also indicate that the amount of data (types of apples) needs to be much larger to get a good understanding of the relations between the variables. A second problem is the absence of any interactions. In much of the data interactions between variables are needed.

Overall model for liking

The various parts of the model have now been defined. But, these separate parts do not make a whole model. This needs to be assembled and subsequently compared with a more traditional model such as (multiblock) PLS.