Sunday, September 28, 2014

Bayesian models in R

There are many ways to run general Bayesian calculations in or from R. The best known are JAGS, OpenBUGS and STAN. Then some time ago Rasmus Bååth had a post Three ways to run Bayesian models in R in which he mentioned LaplacesDemon (not on CRAN) on top of those. A check of the Bayes task view gives 'MCMCpack (...) contains a generic Metropolis sampler that can be used to fit arbitrary models', 'The mcmc package consists of an R function for a random-walk Metropolis algorithm for a continuous random vector' and 'Note that rcppbugs is a package that attempts to provide a pure R alternative to using OpenBUGS/WinBUGS/JAGS for MCMC'. 
Clearly there is a wealth of approaches and each will have its strengths and weaknesses. To get an idea on their comparison I decided to run a number of calculations through all of them. The source of these calculations will be the fine SAS manual, where PROC MCMC has 21 examples. The first example, which I will try this week, is drawing samples from a number of distributions, from which I selected the mixture of three normal distributions.

Task

Draw samples from a mixture of three normal distributions. Let me start by stating that I would not use anything as complex as an MCMC sampler for this task, my personal SAS solution would be:
data p1;
    do i=1 to 1000;
        s=rand('table', .3, .4);
        select (s);
            when (1) r=rand('normal', -3, 2);
            when (2) r=rand('normal', 2, 1);
            when (3) r=rand('normal', 10, 4);
        end;
        output;
    end;
    keep r;
run;

proc sgplot data=p1;
    density r / type=kernel (c=.5);
run;
The equivalent R code:
library(plyr)
library(dplyr)
nu <- c(-3,2,10)
sd <- c(2,1,4)
tau <- sd^-2
wgt <- c(.3,.4,.3)

sampls <- sample(1:3,1000,replace=TRUE,wgt)
sampls <- rnorm(1000,nu[sampls],sd[sampls])
density(sampls) %>%    plot

MCMCpack

Just having a density was enough for MCMCpack. The samples have quite a long autocorrelation but do follow the distribution(not shown).
library(MCMCpack)
mydens <- function(x,...) {
    step1 <- sum(wgt*dnorm(x,nu,sd))
    log(step1)
}
mcmcpack <- MCMCmetrop1R(mydens,theta.init=1)
acf(mcmcpack)

mcmc

mcmc has the property that its result is not the samples, but rather summary statistics of the samples. Hence it does not give samples from the desired distribution.

JAGS

The JAGS samples had no autocorrelation and followed the distribution.
library(R2jags)
jmodel <- function() {
    i ~ dcat(wgt)
    nunu <- (i==1)*nu[1]+
            (i==2)*nu[2]+
            (i==3)*nu[3]
        tautau <- (i==1)*tau[1]+
            (i==2)*tau[2]+
            (i==3)*tau[3]
       
    p ~ dnorm(nunu,tautau)
}

datain <- list(nu=nu,tau=tau,wgt=wgt)
parameters <- c('i','p')

jj <- jags(model.file=jmodel,data=datain,parameters=parameters,DIC=FALSE)

STAN

I did not manage to get STAN to give me the samples.

rccpbugs

I did not manage to get rccpbugs to give me the samples.

LaplacesDemon

LaplacesDemon feels like the most complex of direct R solutions.  
library('LaplacesDemon')
mon.names <- "LP"
parm.names <- 'x'
MyData <- list(mon.names=mon.names, parm.names=parm.names)
N <-1
Model <- function(parm, Data)
{
    x <- parm[1]
    LL <- sum(wgt*dnorm(x,nu,sd)) %>% log
    Modelout <- list(LP=LL, Dev=-2*LL, Monitor=LL,
        yhat=x, parm=parm)
    return(Modelout)
}
Initial.Values <- 1
Fit <- LaplacesDemon(Model, Data=MyData, Initial.Values)

Sunday, September 21, 2014

Trying dplyr on triathon data

There was a triathlon in Almere last week, like every year since 1983. I pulled the data of all years to get some idea how things have changed in that sport. To get a visual I decided to plot the best 10% athletes. Then later I decided this was an ideal moment to look at plyr and dplyr again, so rewrote everything using those tools. I must say I like them very much and intend to use them again.

Data

Data from triathlon-uitslagen.nl, this years website www.challenge-almere.com and wikipedia which told me there was one lady in 1983. The data is in a bunch of spreadsheets (.xls and .xlsx), with different columns used, detail level and even those merged cells which make things look nice. Some years had besides finishers also those who did not, broken down by time of certain milestones. Some years had teams next to individuals. It was a bit too much variation to be coded in R, I did my first cleaning, removing spurious columns and rows, in libreoffice and made relatively uniform .xls files.
During importing R coded most times as date times some as character. The date times had a date (1899-12-31), so I made time to difftimes and retained the original as character. The actual year of the match is pulled from the spreadsheet name.
library(XLConnect)
library(plyr)
library(dplyr)
library(lattice)
setwd('C:\\Users\\Kees\\Documents\\r hobby\\almeretriatlon')
dd <- dir(,'.xls')

readyear <- function(filename) {
  print(filename)
  wb <- loadWorkbook(filename)
  rw <- readWorksheet(wb,sheet=1)
  names(rw) <- tolower(names(rw))
  names(rw)[names(rw)=='total'] <- 'time'
  mutate(rw,
      timec=if (class(rw$time)[1]=='character') {
            time
          } else format(time,'%H:%M:%S') ,
      time=as.difftime(timec),
      year=as.numeric(gsub('[A-Za-z\\. ]','',filename))
  )
}
years0 <- lapply(dd,readyear)
The resulting data may have a gender variable, and/or, a code variable. The code variable may be character and contain gender plus age bracket as digits (or team, or PRO or ELI), but it can also only contain the age bracket and hence be numerical. Gender can be D, H, M, V, W, interpreted as dame, heer, man, vrouw, woman respectively (female, male, male, female, female).
For plotting purposes a two digit Year factor was added, running from 83 to 14.
triatlon <- ldply(years0,function(x) {
# first unify cat info & gender 
          x <- if ('cat' %in% names(x)) {
                if (class(x$cat)[1]=='character') {
                  mutate(x,
                      gender=substr(cat,1,1),
                      cat=substring(cat,2))
                } else  mutate(x,
                      cat=as.character(cat))
              } else mutate(x,cat=NA)
        }) %>% 
    filter(.,!(gender %in% c('T','P'))) %>%
    mutate(.,
        timen=as.numeric(time) , 
        gender=factor(
            ifelse(gender %in% c('H','M'),
                'Male','Female')) ,
        Year=factor(substr(format(year),3,4),
            levels=substr(format(
                    seq(min(year),max(year))),3,4))) %>%
    select(.,
        gender,
        cat,
        year,
        Year,
        time,
        timec,
        timen)  

Plot 

This is actually where plyr/dplyr worked best to get nice clean code, it was as if it was made for this purpose, define groups, define selection and plot. The choice for lattice plots was made before I went to recode and I did not feel like changing it.
The plot shows times steadily decreasing through the previous century and stabilizing around turn of the century. 2006 was an European championship and 2008 world championship, and these years do dot stick out as being faster. Hence these times should be fairly representative of what can be achieved in the Netherlands. There is quite some year to year variation, but that can be attributed to weather and tracks used.  
group_by(triatlon,year,gender) %>% 
    filter(.,percent_rank(time)<0.1) %>%
    bwplot(timen ~ Year | gender,
        data=.,
        ylab='Time (h)',
        xlab='Year',
        horizon=FALSE,
        layout=c(1,2,1),
        drop.unused.levels=FALSE,
        scales=list(cex=.7))

Sunday, September 14, 2014

Trying a prefmap

Preference mapping is a key technique in sensory and consumer research. It links the sensory perception on products to the liking of products and hence provides clues to the development of new, well tasting, products. Even though it is a key technique, it is also a long standing problem how to perform such an analysis. In R the SensoMineR package provides a prefmap procedure. This post attempts to create such an analysis with Stan.

Data

Data are the coctail data from the SensoMineR package. After conversion to a scale 0 to 10 with 10 most liked, the product means are:
   means
1   5.03
2   3.02
3   5.42
4   3.55
5   5.67
6   5.74
7   3.84
8   3.75
9   4.17
10  4.26
11  3.20
12  3.88
13  5.98
14  3.95
15  6.47
16  4.90


Model

The model is build upon my post of last week: Mapping products in a space . What is added is a consumer section. Each consumer's preference is modeled as a ideal point, where liking is maximum, with points further away liked less and less. In mathematical terms the distance dependent function is maxliking * e-distance*scale. The ideal point is captured by three numbers; its liking and its coordinates. The scale function is, for now, common for all consumers. In addition there is a lot of code for administration of all parameters.
model1 <- "
data {
        int<lower=0> ndescriptor;
        int<lower=0> nproduct;
        matrix[nproduct,ndescriptor] profile;
    int<lower=0> nconsumer;
    matrix[nproduct,nconsumer] liking;
}
parameters {
    row_vector[nproduct] prodsp1;
    row_vector[nproduct] prodsp2;
    real<lower=0,upper=1> sigma1;
    real<lower=0,upper=1> sigma2;
    matrix [nconsumer,2] optim;
    vector <lower=0> [nconsumer] maxima;
    real <lower=0> scale;
    real <lower=0> sliking;
}
transformed parameters {
   vector [ndescriptor] descrsp1;
   vector [ndescriptor] descrsp2;
     matrix[nproduct,ndescriptor] expected1;  
     matrix[nproduct,ndescriptor] expected2;  
     matrix[nproduct,ndescriptor] residual1;  
     vector[nproduct] distances;
     matrix[nproduct,nconsumer] likepred;


   descrsp1 <- profile'*prodsp1';
   expected1 <- (descrsp1*prodsp1)';
   residual1 <- profile-expected1;
   descrsp2 <- profile'*prodsp2';
   expected2 <- (descrsp2*prodsp2)';
   for (i in 1:nconsumer) {
      for (r in 1:nproduct) {
      distances[r] <- sqrt(square(prodsp1[r]-optim[i,1])+
                           square(prodsp2[r]-optim[i,2]));
      likepred[r,i] <- maxima[i]*exp(-distances[r]*scale);
      }
   }
}
model {  
     for (r in 1:nproduct) {
        prodsp1[r] ~ normal(0,1);
        prodsp2[r] ~ normal(0,1);
        for (c in 1:ndescriptor) {
           profile[r,c] ~ normal(expected1[r,c],sigma1);
           residual1[r,c] ~ normal(expected2[r,c],sigma2);
        }
        for (i in 1:nconsumer) {
           liking[r,i] ~ normal(likepred[r,i],sliking);
           optim[i,1] ~ normal(0,2);
           optim[i,2] ~ normal(0,2);
        }
    scale ~ normal(1,.1);
    maxima ~ normal(5,2);
    sliking ~ normal(2,1);
    }
}
generated quantities {
   vector [ndescriptor] descrspace1;
   vector [ndescriptor] descrspace2;
   row_vector [nproduct] prodspace1;
   row_vector [nproduct] prodspace2;
   matrix [nconsumer,2] optima;

   prodspace1 <-(
                     ((prodsp1[12]>0)*prodsp1)-
                     ((prodsp1[12]<0)*prodsp1)
                  );
   prodspace2 <-(
                     ((prodsp2[12]>0)*prodsp2)-
                     ((prodsp2[12]<0)*prodsp2)
                  ); 
   descrspace1 <-(
                     ((prodsp1[12]>0)*descrsp1)-
                     ((prodsp1[12]<0)*descrsp1)
                  );
   descrspace2 <-(
                     ((prodsp2[12]>0)*descrsp2)-
                     ((prodsp2[12]<0)*descrsp2)
                  ); 
   for (i in 1:nconsumer) {
      optima[i,1] <- (
                        ((prodsp1[12]>0)*optim[i,1])-
                        ((prodsp1[12]<0)*optim[i,1])
                     );
      optima[i,2] <- (
                        ((prodsp2[12]>0)*optim[i,2])-
                        ((prodsp2[12]<0)*optim[i,2])
                     );
   }
}
"

Analysis results

Sensominer's result

For comparative reasons the plot resulting from SensoMineR's carto() function. I have followed the parameter settings from the SensoMineR package to get this plot. Color is liking, numbered dots are products. The blue zone is best liked, as can be seen from the products with highest means residing there.

New method

In the plot the blue dots are samples of ideal points, the bigger black numbers are locations of products and the smaller red numbers are consumer's ideal points.
This is different from the SensoMineR map , the consumers have pulled well liked products such as 13 and 15 to the center. In a way, I suspect that in this analysis the consumer's preference has overruled most information from the sensory space. Given that, I will be splitting consumers.

Three groups of consumers

Three groups of consumers were created via k-means clustering. From sensory and consumer insight point of view the clusters may describe three different ways to experience the particular products. Obviously a clustering upon demographics or marketing segments may be equally valid, but I don't have that information. The cluster sizes are 15, 52 and 33 respectively.

Cluster 1

This cluster is characterized by liking for products 8 to 11. Compared to the original space, this cluster does not like products 13 and 15 so much, does not dislike product 4 and 12 so much.

Cluster 2

These are the bulk of the consumers and the result of all consumers is more pronounced. However, product 1 has shifter quite a distance to liked.

Cluster 3

This plot is again fairly similar to the all consumer plot. What is noticeable here is that there is a void in the center. The center of the most liked region is not occupied.

Next Steps

There are still some things to improve in this approach. Better tuning of the various priors in the model. Modeling the range of consumer's liking rather than solely their maximum. It may be possible to have the scale parameter subject dependent. Perhaps a better way to extract the dimensions from sensory space, thereby avoiding the Jacobian warning and using estimated standard deviations of the sensory profiling data. Finally, improved graphics.

Code

# Reading and first map

# senso.cocktail
# hedo.cocktail
library(SensoMineR)
data(cocktail)
res.pca <- PCA(senso.cocktail,graph=FALSE)
# SensoMineR does a dev.new for each graph, hence captured like this.
dev.new <- function() png('carto.png')
res.carto <- carto(res.pca$ind$coord[,1:2],
    graph.tree=FALSE,
    graph.corr=FALSE,
    hedo.cocktail)
dev.off()
# reset default graph settings
rm(dev.new)
dev.new()

# model

library(rstan)
nprod <- 16
ndescr <- 13
nconsumer <- 100
sprofile <- as.matrix(scale(senso.cocktail))
datain <- list(
    nproduct=nprod,
    ndescriptor=ndescr,
    profile=sprofile,
    nconsumer=nconsumer,
    liking = as.matrix(10-hedo.cocktail[,1:nconsumer])
)
data.frame(means=rowMeans(10-hedo.cocktail)  )

model1 <- "
data {
        int<lower=0> ndescriptor;
        int<lower=0> nproduct;
        matrix[nproduct,ndescriptor] profile;
    int<lower=0> nconsumer;
    matrix[nproduct,nconsumer] liking;
}
parameters {
    row_vector[nproduct] prodsp1;
    row_vector[nproduct] prodsp2;
    real<lower=0,upper=1> sigma1;
    real<lower=0,upper=1> sigma2;
    matrix [nconsumer,2] optim;
    vector <lower=0> [nconsumer] maxima;
    real <lower=0> scale;
    real <lower=0> sliking;
}
transformed parameters {
   vector [ndescriptor] descrsp1;
   vector [ndescriptor] descrsp2;
     matrix[nproduct,ndescriptor] expected1;  
     matrix[nproduct,ndescriptor] expected2;  
     matrix[nproduct,ndescriptor] residual1;  
     vector[nproduct] distances;
     matrix[nproduct,nconsumer] likepred;


   descrsp1 <- profile'*prodsp1';
   expected1 <- (descrsp1*prodsp1)';
   residual1 <- profile-expected1;
   descrsp2 <- profile'*prodsp2';
   expected2 <- (descrsp2*prodsp2)';
   for (i in 1:nconsumer) {
      for (r in 1:nproduct) {
      distances[r] <- sqrt(square(prodsp1[r]-optim[i,1])+
                           square(prodsp2[r]-optim[i,2]));
      likepred[r,i] <- maxima[i]*exp(-distances[r]*scale);
      }
   }
}
model {  
     for (r in 1:nproduct) {
        prodsp1[r] ~ normal(0,1);
        prodsp2[r] ~ normal(0,1);
        for (c in 1:ndescriptor) {
           profile[r,c] ~ normal(expected1[r,c],sigma1);
           residual1[r,c] ~ normal(expected2[r,c],sigma2);
        }
        for (i in 1:nconsumer) {
           liking[r,i] ~ normal(likepred[r,i],sliking);
           optim[i,1] ~ normal(0,2);
           optim[i,2] ~ normal(0,2);
        }
    scale ~ normal(1,.1);
    maxima ~ normal(5,2);
    sliking ~ normal(2,1);
    }
}
generated quantities {
   vector [ndescriptor] descrspace1;
   vector [ndescriptor] descrspace2;
   row_vector [nproduct] prodspace1;
   row_vector [nproduct] prodspace2;
   matrix [nconsumer,2] optima;

   prodspace1 <-(
                     ((prodsp1[12]>0)*prodsp1)-
                     ((prodsp1[12]<0)*prodsp1)
                  );
   prodspace2 <-(
                     ((prodsp2[12]>0)*prodsp2)-
                     ((prodsp2[12]<0)*prodsp2)
                  ); 
   descrspace1 <-(
                     ((prodsp1[12]>0)*descrsp1)-
                     ((prodsp1[12]<0)*descrsp1)
                  );
   descrspace2 <-(
                     ((prodsp2[12]>0)*descrsp2)-
                     ((prodsp2[12]<0)*descrsp2)
                  ); 
   for (i in 1:nconsumer) {
      optima[i,1] <- (
                        ((prodsp1[12]>0)*optim[i,1])-
                        ((prodsp1[12]<0)*optim[i,1])
                     );
      optima[i,2] <- (
                        ((prodsp2[12]>0)*optim[i,2])-
                        ((prodsp2[12]<0)*optim[i,2])
                     );
   }
}
"

pars <- c('prodspace1','prodspace2','optima','scale','maxima')

fit <- stan(model_code = model1,
    data = datain,
    pars=pars)

# plotting

fitsamps <- as.data.frame(fit)

combiplot <- function(fitsamps,datain,labs) {
    prod <- reshape(fitsamps,
        drop=names(fitsamps)[33:ncol(fitsamps)],
        direction='long',
        varying=list(names(fitsamps)[1:16],
            names(fitsamps)[17:32]),
        timevar='sample',
        times=1:16,
        v.names=c('PDim1','PDim2')
    )
        sa <- sapply(1:16,function(x)
            c(sample=x,
                Dim1=mean(prod$PDim1[prod$sample==x]),
                Dim2=mean(prod$PDim2[prod$sample==x])))
    sa <- as.data.frame(t(sa))
   
    optimindex <- grep('optima',names(fitsamps))
    noptim <- datain$nconsumer
    loc <- reshape(fitsamps,
        drop=names(fitsamps)[(1:ncol(fitsamps))[-optimindex]],
        direction='long',
        varying=list(names(fitsamps)[optimindex[1:noptim]],
            names(fitsamps)[optimindex[(1:noptim)+noptim]]),
        timevar='subject',
        times=1:noptim,
        v.names=c('Dim1','Dim2')
    )
    locx <- loc[sample(nrow(loc),60000),]
    plot(locx$Dim1,locx$Dim2,
        col='blue',
        pch=46,
        cex=2,
        xlim=c(-1,1)*.7,
        ylim=c(-1,1)*.7)
    sa2 <- sapply(1:noptim,function(x)
            c(sample=x,
                Dim1=mean(loc$Dim1[loc$subject==x]),
                Dim2=mean(loc$Dim2[loc$subject==x])))
    sa2 <- as.data.frame(t(sa2))
    text(x=sa2$Dim1,y=sa2$Dim2,labels=labs,cex=.8,col='red')
    text(x=sa$Dim1,y=sa$Dim2,labels=sa$sample,cex=1.5)
    invisible(fitsamps)
}

combiplot(fitsamps,datain,1:100)

# three clusters

tlik <- t(scale(hedo.cocktail))
km <- kmeans(tlik,centers=3)
table(km$cluster)


datain1 <- list(
    nproduct=nprod,
    ndescriptor=ndescr,
    profile=sprofile,
    nconsumer=sum(km$cluster==1),
    liking = as.matrix(10-hedo.cocktail[,km$cluster==1])
)
fit1 <- stan(model_code = model1,
    data = datain1,
    fit=fit,
    pars=pars)

fitsamps1 <- as.data.frame(fit1)
#

datain2 <- list(
    nproduct=nprod,
    ndescriptor=ndescr,
    profile=sprofile,
    nconsumer=sum(km$cluster==2),
    liking = as.matrix(10-hedo.cocktail[,km$cluster==2])
)
fit2 <- stan(model_code = model1,
    data = datain2,
    fit=fit,
    pars=pars)

fitsamps2 <- as.data.frame(fit2)
##
datain3 <- list(
    nproduct=nprod,
    ndescriptor=ndescr,
    profile=sprofile,
    nconsumer=sum(km$cluster==3),
    liking = as.matrix(10-hedo.cocktail[,km$cluster==3])
)
fit3 <- stan(model_code = model1,
    data = datain3,
    fit=fit,
    pars=pars)

fitsamps3 <- as.data.frame(fit3)
combiplot(fitsamps1,datain1,which(km$cluster==1))
combiplot(fitsamps2,datain2,which(km$cluster==2))
combiplot(fitsamps3,datain3,which(km$cluster==3))

Sunday, September 7, 2014

Mapping products in a space

I have read about people doing a Bayesian PCA at some points and always wondered how that would work. Then, at some point I thought of a way to do so. As ideas evolved my interest became not PCA as such, but rather in a prefmap. As a first step in that this post contains the mapping from a sensory space to a two dimensional space. For prefmap this step is commonly done via a PCA.

Data

Data are the coctail data from sensominer package.

Algorithm

The calculation is mostly inspired by the many PLS algorithms to which I was exposed when I was doing chemometrics. Scores and loadings may be obtained from each other by multiplying with the data matrix. In this case it means I just take a set of product scores and obtain the associated descriptors via a simple matrix multiplication. The resulting product and descriptor vectors can be used to reconstruct the original matrix; the best solution minimizes difference between the constructed and original data. For dimension two subtract reconstructed data from original data and repeat on residuals.

Scaling

PCA has the possibility to have unit length scores or loadings, or R and Q mode if that is your favorite jargon. If one has a more singular value decomposition look, it is just where the eigenvalues go. At this point I made the choice to do that in the variable space. 

Unique solution

PCA is known not to have one unique solution; each solution is equivalent to its mirror image. It seemed most elegant to do this completely at the end, after inspection of the data it seemed the location of product 12 was suitable for making the solution unique, since it was extreme on both dimensions. The final step (generated quantities) forces the location to be top right quadrant for data reported.

Code

library(rstan)
nprod <- 16
ndescr <- 13
sprofile <- as.matrix(scale(senso.cocktail,scale=FALSE))
datain <- list(
    nproduct=nprod,
    ndescriptor=ndescr,
    profile=sprofile
    )
   
model1 <- "
data {
        int<lower=0> ndescriptor;
        int<lower=0> nproduct;
        matrix[nproduct,ndescriptor] profile;
}
parameters {
    row_vector[nproduct] prodsp1;
    row_vector[nproduct] prodsp2;
    real<lower=0,upper=1> sigma1;
    real<lower=0,upper=1> sigma2;
}
transformed parameters {
   vector [ndescriptor] descrsp1;
   vector [ndescriptor] descrsp2;
   matrix[nproduct,ndescriptor] expected1;  
   matrix[nproduct,ndescriptor] expected2;  
   matrix[nproduct,ndescriptor] residual1;  

   descrsp1 <- profile'*prodsp1';
   expected1 <- (descrsp1*prodsp1)';
   residual1 <- profile-expected1;
   descrsp2 <- profile'*prodsp2';
   expected2 <- (descrsp2*prodsp2)';
}
model {  
     for (r in 1:nproduct) {
        prodsp1[r] ~ normal(0,1);
        prodsp2[r] ~ normal(0,1);
        for (c in 1:ndescriptor) {
           profile[r,c] ~ normal(expected1[r,c],sigma1);
           residual1[r,c] ~ normal(expected2[r,c],sigma2);
        }
     }
}
generated quantities {
   vector [ndescriptor] descrspace1;
   vector [ndescriptor] descrspace2;
   row_vector [nproduct] prodspace1;
   row_vector [nproduct] prodspace2;
   prodspace1 <-(
                     ((prodsp1[12]>0)*prodsp1)-
                     ((prodsp1[12]<0)*prodsp1)
                  );
   prodspace2 <-(
                     ((prodsp2[12]>0)*prodsp2)-
                     ((prodsp2[12]<0)*prodsp2)
                  ); 
   descrspace1 <-(
                     ((prodsp1[12]>0)*descrsp1)-
                     ((prodsp1[12]<0)*descrsp1)
                  );
   descrspace2 <-(
                     ((prodsp2[12]>0)*descrsp2)-
                     ((prodsp2[12]<0)*descrsp2)
                  ); 
}
"
pars <- c('prodspace1','prodspace2','descrspace1','descrspace2')
fit1 <- stan(model_code = model1,
    data = datain,
    pars=pars)

Results

For comparison, first a standard biplot.

Product space

It is not difficult to extract the samples and plot them. See end of post. One notable property of the plot is that the products are in ellipses with the minor axis towards the center. Apparently part of variation between MCMC samples is rotational freedom between dimensions. Other than that the solution is actually pretty close to the PCA

Descriptor space

The rotational freedom is even more clear here.

Additional code

data

library(SensoMineR)
data(cocktail)

biplot

pr <- prcomp(senso.cocktail) 
plot(pr)
biplot(pr)

product plot

fit1samps <- as.data.frame(fit1)

prod <- reshape(fit1samps,
    drop=names(fit1samps)[33:59],
    direction='long',
    varying=list(names(fit1samps)[1:16],
        names(fit1samps)[17:32]),
    timevar='sample',
    times=1:16,
    v.names=c('PDim1','PDim2')
)
   
prod <- prod[order(prod$PDim1),]
plot(prod$PDim1,prod$PDim2,
    col=c(2,17,3,4,6,5,7:10,13,12,11,14:16)[prod$sample],
    pch=46,
    cex=2,
    xlim=c(-1,1)*.75,
    ylim=c(-1,1)*.75)
sa <- sapply(1:16,function(x)
        c(sample=x,
            Dim1=mean(prod$PDim1[prod$sample==x]),
            Dim2=mean(prod$PDim2[prod$sample==x])))
sa <- as.data.frame(t(sa))
text(x=sa$Dim1,y=sa$Dim2,labels=sa$sample,cex=1.5)

descriptor plot

descr <- reshape(fit1samps,
    drop=names(fit1samps)[c(1:32,59)],
    direction='long',
    varying=list(names(fit1samps)[33:45],
        names(fit1samps)[46:58]),
    timevar='sample',
    times=1:13,
    v.names=c('DDim1','DDim2')
)

descr <- descr[order(descr$DDim1),]
plot(descr$DDim1,descr$DDim2,
    col=c(2,1,3:13)[descr$sample],
    pch=46,
    cex=2,
    xlim=c(-1,1)*9,
    ylim=c(-1,1)*9)
sa <- sapply(1:13,function(x)
        c(sample=x,
            Dim1=mean(descr$DDim1[descr$sample==x]),
            Dim2=mean(descr$DDim2[descr$sample==x])))
sa <- as.data.frame(t(sa))
text(x=sa$Dim1,y=sa$Dim2,labels=names(senso.cocktail))

Sunday, August 31, 2014

Beta binomial revisited

I got some comments on my priors of last week's post where better priors were proposed. Hence this week's post looks at them. After some minor adaptations, since I have nine beta parameters and beta needs to be fairly high, I can conclude that these priors provide significant improvements. Especially the model which has p(alpha,beta) prop to (alpha + beta)(-5/2) worked extremely well.

Priors

There were two proposals. A common step was 'parameterizing in terms of mean theta = (alpha / (alpha + beta)) and total count kappa = (alpha + beta)'. In my own data there are nine regions hence nine sets of theta, kappa, alpha, beta. As hyperprior over the nine regions' incidence I chose a normal distribution for theta.  The difference between the proposals is in kappa. I made no hyperpriors over the nine values of kappa.

Prior: Uniform on log scale

'Andrew (...) suggests taking the priors on alpha and beta to be uniform on the log scale.' In code this means log kappa is uniform distributed.

Prior: p(alpha,beta) prop to (alpha + beta)(-5/2)

Apparently in Bayesian Data Analysis 3 it is proposed to make p(alpha,beta) prop to (alpha + beta)(-5/2). This is handled via a Pareto distribution.

Sampling kappa priors

Via a small program I had a look what this means in terms of samples. The uniform log scale seemed to be reasonable even for the range of my beta parameters. On the other hand the p(alpha,beta) prop to (alpha + beta)(-5/2) proposal kept kappa way too low for the data at hand. This seems informative for a different range of kappa and hence beta rather than uninformative. There I made a different variation.
modelk <- '
parameters {
    real kappa_log1;
    real kappa2;
    real <lower=10^6> kappa3;
}
transformed parameters {
    real kappa1;
    kappa1 <- exp(kappa_log1);
}
model {      
    kappa_log1 ~ uniform(0,18);
    kappa2 ~ pareto(0.1,1.5);
    kappa3 ~ pareto(10^6,1.5);
}
'
parametersk <- c('kappa1','kappa2','kappa3')
initsk <- function() {
    list(
        kappa_log1=runif(1,0,18)
        , kappa2 = runif(1,.2,10)
        , kappa3 = runif(1,2e6,1e7)
    )
}

kfits <- stan(model_code = modelk,
        pars=parametersk,
        inits=initsk)
kfits

Inference for Stan model: modelk.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.

             mean   se_mean         sd       2.5%        25%        50%
kappa1 1994621.64 550416.10 7754182.44       1.57      82.85    4772.51
kappa2       0.24      0.02       0.27       0.10       0.12       0.16
kappa3 2602295.43 296845.16 4314974.09 1021790.77 1209898.43 1558773.41
lp__       -18.81      0.16       1.65     -23.11     -19.64     -18.41
              75%       97.5% n_eff Rhat
kappa1  160223.83 25951765.02   198 1.03
kappa2       0.25        0.89   123 1.03
kappa3 2490623.45 11991112.38   211 1.02
lp__       -17.59      -16.83   112 1.03

Samples were drawn using NUTS(diag_e) at Sat Aug 30 14:12:51 2014.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

Effect in the whole model

Prior: Uniform on log scale 

This works reasonable well. With significantly less samples than last week there are reasonable estimates. The only thing worrying is that incidence[1] (it was named betamean[1] in the previous models, but that was actually a bit of a misnomer) is a bit too high.
 model1 <- '
    data {
       int<lower=0> n;
       int<lower=0> nregion;
       int count[n];
       int<lower=1,upper=nregion> Region[n];
       int Population[n];
       real scale;
        }
    parameters {
       vector <lower=0,upper=1> [nregion]  theta;
       vector [nregion]  kappa_log;
       vector <lower=0,upper=1> [n] p1;
       real <lower=0,upper=1> thetam;
       real <lower=0> thetasd;
        }
  transformed parameters {
       vector [nregion] kappa;
       vector [nregion] a;
       vector [nregion] b;

       kappa <- exp(kappa_log);
       for (i in 1:nregion) {
          a[i] <- kappa[i] * theta[i];
          b[i] <- kappa[i] * (1 - theta[i]);
       }
  }
  model {     
        kappa_log ~ uniform(12,18);
        for (i in 1:n) {
            p1[i] ~ beta(a[Region[i]],b[Region[i]]);
        }
        count ~ binomial(Population,p1);
        theta ~ normal(thetam,thetasd);
        thetam ~ uniform(0,1);
        thetasd ~ uniform(0,.1);      
        }
  generated quantities {
        vector [n] pp;
        vector [nregion] incidence;
        real meaninc;
        incidence <- scale*theta;
        pp <- p1 * scale;
        meaninc <- scale*thetam;
        }
'

inits1 <- function()
    list(theta=rnorm(datain$nregion,1e-6,1e-7),
         thetam = rnorm(1,1e-6,1e-7),
         thetasd = rnorm(1,1e-6,1e-7),
         kappa_log=runif(datain$nregion,15,18),
         p1=rnorm(datain$n,1e-7,1e-8))

parameters1 <- c('a','b','incidence','kappa','kappa_log','meaninc')

fits1 <- stan(model_code = model1,
    data = datain,
    pars=parameters1    ,
    init=inits1)
print(fits1,probs=NA)

Inference for Stan model: model1. 
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.

                    mean    se_mean          sd    n_eff Rhat
a[1]                7.95       1.32       13.67 NA   107 1.04
a[2]                9.16       1.43       16.76 NA   138 1.02
a[3]               50.52       2.61       32.13 NA   151 1.03
a[4]               35.68       2.25       32.61 NA   210 1.01
a[5]                9.10       1.32       13.99 NA   113 1.04
a[6]               11.99       1.48       17.40 NA   139 1.01
a[7]               13.64       1.83       20.56 NA   126 1.01
a[8]               14.19       2.56       20.35 NA    63 1.05
a[9]               12.14       2.51       21.39 NA    73 1.05
b[1]          5956080.76 1213799.11 10903417.54 NA    81 1.04
b[2]          5293475.42  844235.36  9984631.29 NA   140 1.01
b[3]         28376424.43 1399820.31 17900739.23 NA   164 1.03
b[4]         19757543.83 1195288.60 17665183.92 NA   218 1.02
b[5]          5493903.33  897750.46  9136674.30 NA   104 1.05
b[6]          6449567.62  787646.39  9190268.32 NA   136 1.01
b[7]          8500594.37 1172609.55 12531293.24 NA   114 1.01
b[8]          8901106.72 1698740.11 12644831.14 NA    55 1.06
b[9]          7267575.15 1406258.66 12121859.82 NA    74 1.05
incidence[1]        0.10       0.00        0.02 NA    84 1.07
incidence[2]        0.11       0.00        0.02 NA   177 1.02
incidence[3]        0.11       0.00        0.01 NA   153 1.01
incidence[4]        0.11       0.00        0.01 NA   178 1.02
incidence[5]        0.11       0.00        0.02 NA   216 1.01
incidence[6]        0.12       0.00        0.02 NA   121 1.03
incidence[7]        0.10       0.00        0.02 NA   161 1.02
incidence[8]        0.10       0.00        0.02 NA    81 1.06
incidence[9]        0.10       0.00        0.02 NA   119 1.04
kappa[1]      5956088.71 1213800.44 10903430.55 NA    81 1.04
kappa[2]      5293484.58  844236.78  9984647.90 NA   140 1.01
kappa[3]     28376474.95 1399822.89 17900770.95 NA   164 1.03
kappa[4]     19757579.51 1195290.81 17665216.05 NA   218 1.02
kappa[5]      5493912.43  897751.76  9136688.06 NA   104 1.05
kappa[6]      6449579.61  787647.86  9190285.60 NA   136 1.01
kappa[7]      8500608.01 1172611.40 12531313.60 NA   114 1.01
kappa[8]      8901120.91 1698742.67 12644851.22 NA    55 1.06
kappa[9]      7267587.30 1406261.16 12121881.04 NA    74 1.05
kappa_log[1]       14.56       0.18        1.40 NA    60 1.08
kappa_log[2]       14.37       0.13        1.42 NA   122 1.01
kappa_log[3]       16.88       0.07        0.87 NA   167 1.03
kappa_log[4]       16.28       0.08        1.15 NA   210 1.01
kappa_log[5]       14.76       0.11        1.17 NA   119 1.03
kappa_log[6]       15.10       0.08        1.03 NA   153 1.01
kappa_log[7]       15.08       0.17        1.36 NA    61 1.03
kappa_log[8]       15.12       0.22        1.38 NA    41 1.08
kappa_log[9]       14.76       0.17        1.45 NA    70 1.04
meaninc             0.11       0.00        0.01 NA   107 1.02
lp__            -7659.18       1.14       11.63 NA   105 1.01

Samples were drawn using NUTS(diag_e) at Sat Aug 30 14:24:52 2014.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

Prior: p(alpha,beta) prop to (alpha + beta)(-5/2)


This model did quite well. I wish all my models worked as well.
model2 <- '
data {
    int<lower=0> n;
    int<lower=0> nregion;
    int count[n];
    int<lower=1,upper=nregion> Region[n];
    int Population[n];
    real scale;
}
parameters {
    vector <lower=0,upper=1> [nregion]  theta;
    vector <lower=1e6> [nregion]  kappa; 
    vector <lower=0,upper=1> [n] p1;
    real <lower=0,upper=1> thetam;
    real <lower=0> thetasd;
}
transformed parameters {
    vector [nregion] a;
    vector [nregion] b;
    for (i in 1:nregion) {
        a[i] <- kappa[i] * theta[i];
        b[i] <- kappa[i] * (1 - theta[i]);
    }
   
}
model {      
    kappa ~ pareto(10^6,1.5);
    for (i in 1:n) {
        p1[i] ~ beta(a[Region[i]],b[Region[i]]);
    }
    count ~ binomial(Population,p1);
    theta ~ normal(thetam,thetasd);
    thetam ~ uniform(0,1);
    thetasd ~ uniform(0,.1);       
}
generated quantities {
    vector [n] pp;
    vector [nregion] incidence;
    real meaninc;
    incidence <- scale*theta;
    pp <- p1 * scale;
    meaninc <- scale*thetam;
}
'

inits2 <- function()
    list(theta=rnorm(datain$nregion,1e-6,1e-7),
        thetam = rnorm(1,1e-6,1e-7),
        thetasd = rnorm(1,1e-6,1e-7),
        kappa=runif(datain$nregion,1e6,1e7),
        p1=rnorm(datain$n,1e-7,1e-8))

parameters2 <- c('a','b','incidence','kappa','meaninc')

fits2 <- stan(model_code = model2,
    data = datain,
    pars=parameters2,
    init=inits2)

print(fits2,probs=NA)

Inference for Stan model: model2.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.

                    mean   se_mean          sd    n_eff Rhat
a[1]                3.03      0.06        2.54 NA  1851    1
a[2]                3.36      0.05        2.74 NA  2746    1
a[3]               22.28      1.39       39.52 NA   807    1
a[4]                7.27      0.27       12.39 NA  2173    1
a[5]                3.61      0.08        3.50 NA  2025    1
a[6]                4.38      0.09        3.70 NA  1743    1
a[7]                3.58      0.13        4.14 NA   957    1
a[8]                3.54      0.11        3.60 NA  1125    1
a[9]                3.20      0.10        3.81 NA  1346    1
b[1]          2132903.53  47586.20  2085059.70 NA  1920    1
b[2]          1804395.74  29404.95  1540741.77 NA  2745    1
b[3]         12482567.37 782922.81 22226255.17 NA   806    1
b[4]          4057579.18 146369.50  6818624.58 NA  2170    1
b[5]          2080275.79  44577.27  2020891.35 NA  2055    1
b[6]          2368569.05  48762.30  1922950.01 NA  1555    1
b[7]          2281906.54  81326.60  2583506.11 NA  1009    1
b[8]          2319737.52  71109.80  2408437.85 NA  1147    1
b[9]          2092476.48  61180.23  2313166.69 NA  1430    1
incidence[1]        0.09      0.00        0.02 NA   746    1
incidence[2]        0.12      0.00        0.02 NA  1173    1
incidence[3]        0.11      0.00        0.02 NA  1209    1
incidence[4]        0.11      0.00        0.02 NA  1387    1
incidence[5]        0.11      0.00        0.02 NA  1406    1
incidence[6]        0.12      0.00        0.02 NA  1248    1
incidence[7]        0.10      0.00        0.02 NA   999    1
incidence[8]        0.10      0.00        0.02 NA  1063    1
incidence[9]        0.10      0.00        0.02 NA   935    1
kappa[1]      2132906.56  47586.25  2085062.04 NA  1920    1
kappa[2]      1804399.10  29405.01  1540744.41 NA  2745    1
kappa[3]     12482589.65 782924.19 22226294.41 NA   806    1
kappa[4]      4057586.45 146369.76  6818636.82 NA  2170    1
kappa[5]      2080279.40  44577.35  2020894.78 NA  2055    1
kappa[6]      2368573.43  48762.38  1922953.62 NA  1555    1
kappa[7]      2281910.12  81326.73  2583510.17 NA  1009    1
kappa[8]      2319741.06  71109.91  2408441.36 NA  1147    1
kappa[9]      2092479.69  61180.33  2313170.41 NA  1430    1
meaninc             0.11      0.00        0.01 NA   901    1
lp__            -7881.32      0.42        8.66 NA   428    1

Samples were drawn using NUTS(diag_e) at Sat Aug 30 14:34:49 2014.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).