Tuesday, May 22, 2012

A complete Bayesian model for sensory profiling data

In this post I will try to add an important parts in the sensory profiling model I have been building. This concerns the question: 'Are all panelists equally reproducible?'. Obviously the answer is no, some are better than others. From this observation stems the approach in which under performing panelists are removed prior to the final analysis. The beauty of the Bayesian model is that this removal is not needed. The model can weigh down the panelists based on performance.
Finally, it could be chosen to introduce the concept that the data is not normal. An error distribution with fatter tails may give more robust results. For the current data, any look at the data will reveal that it is not normal. Panelists only use the numbers 0 to 10, suggesting a ordered logistic model, which is for another time. A continuous line scale with values 0 to 100 is used most often in sensory profiling. For a line scale using a t distribution rather than normal distribution can be appropriate. The beauty is, in the Bayesian model the data can be used to determine the degrees of freedom for the t distribution. To allow general application of the model I will leave the error normal though.

Modelling panelists' individual error

The model statements needed for this model part are taken from Data analysis using regression and multilevel/hierarchical models' by Gelman and Hill (2007).
First the error must be made panelist dependent:

y[i] ~ dnorm(fit[i],tau)

becomes

y[i] ~ dnorm(fit[i],tauPanelist[Panelist[i]])
Obviously tauPanelist needs its prior distribution, for which we need non informative hyperpriors. 
for (i in 1:nPanelist) {
         tauPanelist[i] <-  pow(sdPanelist[i],-2)
sdPanelist[i] ~ dlnorm(mu.lsdP,tau.lsdP)
        }
mu.lsdP ~ dnorm(0,.0001)
tau.lsdP <- pow(sigma.lsdP,-2)
sigma.lsdP ~ dunif(0,100)

Our interest is in sdPanelist[ ], which will be examined in the result phase.
A consequence of adding this model part is the need for more precise variable names. There was a variable sdPanelist before, which concerned the standard deviation between panelists' means. This has been renamed to sdPanel.

Result

Results consist of three parts, statements about the model, the panel and statements about the product. 

Model results

This section should contain the output of the fit. However, by now it is 128 lines of output and a stack of figures. Who wants that in a blog? For briefness a few remarks. I was very afraid that the model would misbehave and give correlation between panelists error and panelists scale usage values. This did not happen in any obvious way, for which I am very happy. 
What did happen was a lack of mixing for the term sdSessionPanelist. If this happens more often action is needed. As it is, this is a parameter of minor interest, so I let it be. However, the temptation is there to either attempt to reformulate the model or increase the number of samples.

Panel results

To start with the panel, we can draw three figures, location, scale and error.
Panelists' means reveal that panelist 10 scores a bit higher than the other panelists. Panelist 25 also has a bit high value and 21 a bit low value. 
Panelists' scales reveal that panelists 13 and 23 use a bit wider scale than the other panelists.  
Finally we have the figure for panelists' error. Panelist 29, 7 and 20 are less well performing than the other panelists. 
All things considered, I would focus on the most important parts of the panel performance. The few panelists with high standard error, such as panelist 29. After that I would look at location. Scale usage seems to be much more in order.
However, there are 14 descriptors in the data, so similar plots can be made for the other descriptors. So, based on those results I would reassess my statements about training. Hence we need all the analysis. While fitting all this in a loop is not much of a problem, this does not seem satisfactory. Panels run as a matter of standard on daily or weekly basis. How about a report for each product set, in full, within a day? This way panel feedback can be quick and efficient. To get there, it would be nice to make the analysis in a standard report. This seems to beg for usage of sweave, odfweave or knitr. Especially knitr seems to be the product to use, considering the recent postings on www.r-bloggers.com.

Product results

The added model terms have increased the number of product differences and decreased standard deviations. This is not so strange, the less performing panelists are weighed down. In this respect, it seems the model is performing admirably.
   v1 v2

2   1  3
3   2  3
4   1  4
6   3  4
9   3  5
11  1  6
13  3  6



          Mean        SD    Naive SE Time-series SE
choc1 6.987792 0.1833657 0.002899266    0.002886758
choc2 6.597215 0.1796334 0.002840253    0.003067686
choc3 4.842514 0.1975550 0.003123619    0.003345383
choc4 6.363283 0.1787572 0.002826399    0.002985240
choc5 6.665101 0.1833072 0.002898341    0.003356131
choc6 6.486780 0.1795052 0.002838226    0.002685428

R code

library(SensoMineR)
library(coda)
library(R2jags)

FullContrast <- function(n) {
UseMethod('FullContrast',n)
}
FullContrast.default <- function(n) stop('FullContrast only takes integer and factors')
FullContrast.integer <- function(n) {
mygrid <- expand.grid(v1=1:n,v2=1:n)
mygrid <- mygrid[mygrid$v1<mygrid$v2,]
rownames(mygrid) <- 1:nrow(mygrid)
as.matrix(mygrid)
}
FullContrast.factor <- function(n) {
FullContrast(nlevels(n))
}

data(chocolates)

data_list <- with(sensochoc,list(Panelist = as.numeric(Panelist),
nPanelist = nlevels(Panelist),
Product = as.numeric(Product),
nProduct = nlevels(Product),
Round = Rank,
nRound = length(unique(Rank)),
Session = Session,
nSession = length(unique(Session)),
y=CocoaA,
N=nrow(sensochoc)))
data_list$Productcontr <-  FullContrast(sensochoc$Product) 
data_list$nProductcontr <- nrow(data_list$Productcontr)


model.file <- file.path(tempdir(),'mijnmodel.txt')

mymodel <- function() {
# core of the model  
for (i in 1:N) {
fit[i] <-  grandmean +  mPanelist[Panelist[i]] +  mSessionPanelist[Session[i],Panelist[i]] + 
(mProduct[Product[i]] + mRound[Round[i]]) * sPanelist[Panelist[i]]
y[i] ~ dnorm(fit[i],tauPanelist[Panelist[i]])
}
 
for (i in 1:nPanelist) {
    tauPanelist[i] <-  pow(sdPanelist[i],-2)
sdPanelist[i] ~ dlnorm(mu.lsdP,tau.lsdP)
    }
mu.lsdP ~ dnorm(0,.0001)
tau.lsdP <- pow(sigma.lsdP,-2)
sigma.lsdP ~ dunif(0,100)
grandmean ~ dnorm(0,.001)
# variable Panelist distribution  
for (i in 1:nPanelist) {
mPanelist[i] ~ dnorm(0,tauPanelMean) 
}
tauPanelMean ~ dgamma(0.001,0.001)
sdPanelMean <- sqrt(1/tauPanelMean)
# Product distribution 
for (i in 1:nProduct) {
mProduct[i] ~ dnorm(0,tauProduct)
}
tauProduct ~ dgamma(0.001,0.001)
sdProduct <- sqrt( 1/tauProduct)
for (i in 1:nRound) {
mRound[i] ~ dnorm(0,tauRound) 
}
tauRound <- 1/(sdRound^2)
sdRound ~ dnorm(0,4) %_% T(0,)
for (i in 1:nSession) {
for (j in 1:nPanelist) {
  mSessionPanelist[i,j] ~ dnorm(0,tauSessionPanelist)
   }
}
tauSessionPanelist <- 1/(sdSessionPanelist^2)
sdSessionPanelist ~ dnorm(0,4) %_% T(0,)
#
# distribution of the multiplicative effect
for (i in 1:nPanelist) {
sPanelist[i] <- exp(EsPanelist[i])
EsPanelist[i] ~ dnorm(0,9)  
}
# getting the interesting data
# true means for Panelist
for (i in 1:nPanelist) {
meanPanelist[i] <-  grandmean + mPanelist[i] +  
mean(mSessionPanelist[1:nSession,i]) + 
      ( mean(mProduct[1:nProduct]) + mean(mRound[1:nRound]))*sPanelist[i]
}
# true means for Product
for (i in 1:nProduct) {
meanProduct[i] <- grandmean + 
mean(mPanelist[1:nPanelist]) + 
mean(mSessionPanelist[1:nSession,1:nPanelist]) +
(mProduct[i] + mean(mRound[1:nRound]) )*exp(mean(EsPanelist[1:nPanelist]))  
}
for (i in 1:nProductcontr) {
Productdiff[i] <- meanProduct[Productcontr[i,1]]-meanProduct[Productcontr[i,2]]
}
}

write.model(mymodel,con=model.file)

inits <- function() list(
grandmean = rnorm(1,3,1),
mPanelist = c(0,rnorm(data_list$nPanelist-1)) ,
mProduct = c(0,rnorm(data_list$nProduct-1)) ,
EsPanelist = rnorm(data_list$nPanelist) ,
tau = runif(1,1,2),
tauPanel = runif(1,1,3),
tauProduct = runif(1,1,3),
mRound = rnorm(data_list$nRound),
sdRound = abs(rnorm(1)),
mSessionPanelist= matrix(rnorm(data_list$nPanelist*data_list$nSession),nrow=data_list$nSession),
sdSessionPanelist = abs(rnorm(1)),
sdPanelist=runif(data_list$nPanelist,.5,1.5),
mu.lsdP=.1,
sigma.lsdP=.1
)

#parameters <- c('sdPanelist','sdProduct','gsd','meanPanelist','meanProduct','Productdiff','sPanelist','sdRound','mRound')
parameters <- c('sdPanelMean','sdProduct','meanProduct','Productdiff','mPanelist','sPanelist','sdRound','mRound','sdSessionPanelist','sdPanelist','mu.lsdP','sigma.lsdP')

jagsfit <- jags(data=data_list,inits=inits,model.file=model.file,parameters.to.save=parameters,n.chains=4,DIC=FALSE,n.iter=20000)

# jagsfit # not shown
# plot(jagsfit) # not shown

jagsfit.mc <- as.mcmc(jagsfit)
# plot(jagsfit.mc) # not shown

fitsummary <- summary(jagsfit.mc)

# extract the scale effects and plot them
sPanelist <- fitsummary$quantiles[ grep('sPanelist',rownames(fitsummary$quantiles)),]
colnames(sPanelist) <-c('x2.5','x25','x50','x75','x97.5')
sPanelist <- as.data.frame(sPanelist)
sPanelist$pnum <- 1:nrow(sPanelist)
library(ggplot2)
limits <- aes(ymax = sPanelist$x97.5, ymin=sPanelist$x2.5) 
p <- ggplot(sPanelist, aes(y=x50, x=pnum)) 
p + geom_point() + geom_errorbar(limits, width=0.2) + scale_y_log10('Panelist scale') + scale_x_continuous("Panelist number") 

sdPanelist <- fitsummary$quantiles[ grep('sdPanelist',rownames(fitsummary$quantiles)),]
colnames(sdPanelist) <-c('x2.5','x25','x50','x75','x97.5')
sdPanelist <- as.data.frame(sdPanelist)
sdPanelist$pnum <- 1:nrow(sdPanelist)
limits <- aes(ymax = sdPanelist$x97.5, ymin=sdPanelist$x2.5) 
p <- ggplot(sdPanelist, aes(y=x50, x=pnum)) 
p + geom_point() + geom_errorbar(limits, width=0.2) + scale_y_log10('Panelist standard error') + scale_x_continuous("Panelist number") 

mPanelist <- fitsummary$quantiles[ grep('mPanelist',rownames(fitsummary$quantiles)),]
colnames(mPanelist) <-c('x2.5','x25','x50','x75','x97.5')
mPanelist <- as.data.frame(mPanelist)
mPanelist$pnum <- 1:nrow(mPanelist)
limits <- aes(ymax = mPanelist$x97.5, ymin=mPanelist$x2.5) 
p <- ggplot(mPanelist, aes(y=x50, x=pnum)) 
png('meanpanelist.png')
p + geom_point() + geom_errorbar(limits, width=0.2) + scale_y_continuous('Panelist mean') + scale_x_continuous("Panelist number") 
dev.off()

# extract product differences
Productdiff <- fitsummary$quantiles[ grep('Productdiff',rownames(fitsummary$quantiles)),]
# extract differences different from 0
data_list$Productcontr[Productdiff[,1]>0 | Productdiff[,5]<0,]
# get the product means
ProductMean <- fitsummary$statistics[ grep('meanProduct',rownames(fitsummary$quantiles)),]
rownames(ProductMean) <- levels(sensochoc$Product)
ProductMean


No comments:

Post a Comment