Wiekvoet: March 2012

Saturday, March 24, 2012

Linking apple liking to sensory

Previously it was seen that apple liking was related to consumers scores for juiciness and sweetness. It would be most nice if these scores can be linked to sensory scores. Thus a three block model would result:

A block with sensory data describing how the apples taste
A block with consumer data describing how the apples are perceived by the consumers
A block with consumer data describing how the apples are liked

It would be most efficient to use randomForest to get the next connection. However, I like to try new things in this blog and it would be boring for the reader. Just to try, I decided to use Bayesian Model Averaging.

Juiciness

library(xlsReadWrite)
library(BMA)
library(ggplot2)
# read and prepare the data
datain <- read.xls('condensed.xls')
#remove storage conditions

datain <- datain[-grep('bag|net',datain$Products,ignore.case=TRUE),]
datain$week <- sapply(strsplit(as.character(datain$Product),'_'),
function(x) x[[2]])
#convert into numerical
dataval <- datain
vars <- names(dataval)[-1]
for (descriptor in vars) {
dataval[,descriptor] <- as.numeric(gsub('[[:alpha:]]','',
dataval[,descriptor]))
}

#Independent variables are Sensory variables, these all start with S

indepV <- grep('^S',vars,value=TRUE)
xblock <- as.matrix(dataval[,indepV])
bcJ <- bicreg(y=dataval$CJuiciness,x=xblock)

Warning message:
In if ((1 - r2/100) <= 0) stop("a model is perfectly correlated with the response") :
the condition has length > 1 and only the first element will be used

summary(bcJ)

Call:

bicreg(x = xblock, y = dataval$CJuiciness)

p!=0 EV SD model 1 model 2 model 3 model 4 model 5
Intercept 100.0 2.9100926 0.403545 3.055344 2.619845 2.913889 2.924000 3.170406
SCrispness 21.9 0.0013282 0.004954 . . . . .
SFirmness 20.2 -0.0011815 0.004592 . . . . .
SJuiciness 100.0 0.0245677 0.005499 0.026477 0.022773 0.021240 0.025762 0.027249
SMealiness 13.5 0.0002949 0.002505 . . . . .
SSweetness 58.4 -0.0050294 0.005644 -0.008973 . . -0.006871 -0.009673
SSourness 37.7 0.0014320 0.002701 . 0.004351 . 0.001453 .
SFlavor 17.0 -0.0004604 0.003566 . . . . -0.002023

nVar 2 2 1 3 3
r2 0.728 0.710 0.636 0.731 0.729
BIC -17.631267 -16.497892 -15.312821 -14.954309 -14.852938
post prob 0.198 0.113 0.062 0.052 0.049
plot(bcJ,mfrow=c(3,3))

The models show a link with SJuiciness, which is what is expected. This link is clearly a positive correlation. A second link is to SSweetness, which is probably negative. Alternatively, but less probable, this link may be a positive link to SSourness. The model does not account for any curvature by quadratic or interaction effects. This is a bit of a disadvantage, but for this model it is not required. From sensory point of view, it can go either way, depending on how the scales are created for data acquisition. (the warning R dropped was ignored)

Sweetness

bcS <- bicreg(y=dataval$CSweetness,x=xblock)

Warning message:

In if ((1 - r2/100) <= 0) stop("a model is perfectly correlated with the response") :

the condition has length > 1 and only the first element will be used

summary(bcS)

Call:

bicreg(x = xblock, y = dataval$CSweetness)

  27  models were selected

 Best  5  models (cumulative posterior probability =  0.5121 ): 

            p!=0    EV         SD        model 1     model 2     model 3     model 4     model 5   

Intercept   100.0   3.6442040  0.518335    3.433920    4.091197    3.165198    3.424928    3.815567

SCrispness   85.7  -0.0077878  0.004981   -0.006778   -0.011925   -0.007570       .       -0.012148

SFirmness    29.7  -0.0012657  0.003688       .           .           .       -0.006709       .    

SJuiciness   16.2  -0.0001717  0.001631       .           .           .           .           .    

SMealiness   41.6  -0.0037273  0.006153       .       -0.009068       .           .       -0.008314

SSweetness  100.0   0.0098838  0.002804    0.010048    0.009443    0.010997    0.009795    0.010274

SSourness    18.2  -0.0001813  0.001224       .           .           .           .           .    

SFlavor      28.1   0.0013050  0.003288       .           .        0.004341       .        0.003569

nVar                                         2           3           3           2           4     

r2                                         0.702       0.742       0.723       0.664       0.755   

BIC                                      -16.023858  -15.699168  -14.426420  -13.864248  -13.788552

post prob                                  0.173       0.147       0.078       0.059       0.056   

CSweetness can be explained via SSweetness, as was hoped and expected and one or two other variables; SCrispness and SMealiness, both of these probably negative

plot(bcS,mfrow=c(3,3))

BMA

My final verdict on BMA is mixed. It is nice to get an insight about important independent variables, effects of other variables. It is odd that to get an warning, but since the result seems to make sense ignoring is not too bad an approach. The main dislike is complete inability to handle interactions and quadratic effects. Even while it is not difficult to extend xblock to contain all these, the calculation time becomes prohibitive and BMA does not understand that if a model has a quadratic term or interaction terms it also requites the main effects. For the current model step, this is not the largest problem.

Sunday, March 18, 2012

Liking of apples - more than juiciness

In a previous blog it was shown using literature data that liking of apples was related to juiciness. However, there were some questions

Is the relation linear or slightly curved?
The variation in liking around CJuiciness is large. Are more explanatory variables needed?
So, what drives CJuiciness?

In this post it becomes clear that indeed there is more to liking of apples than just juiciness.
Error in average scores
The paper of Peneau et al. uses Tukey's post hoc test to examine the differences between the products. The test is performed within the weeks. First we use the data to retrieve at what difference the test shows significant differences.
library(xlsReadWrite)
library(plyr)
library(ggplot2)
library(locfit)
#get data as before

datain <- read.xls('condensed.xls')
#remove storage conditions
datain <- datain[-grep('bag|net',datain$Products,ignore.case=TRUE),]
#create week variable
datain$week <- sapply(strsplit(as.character(datain$Product),'_'),
function(x) x[[2]])
#function to extract pairwise numerical differences and significances
extract.diff <- function(descriptor) {
diffs1 <- ldply(c('W1','W2'),function(Week) {
value <-
as.numeric(gsub('[[:alpha:]]','',
datain[,descriptor]))[datain$week==Week]
sig.dif <-
gsub('[[:digit:].]','',datain[,descriptor])[datain$week==Week]
dif.mat <- outer(value,value,'-')
sig.mat <- outer(sig.dif,sig.dif,function(X,Y) {
sapply(1:length(X),function(i) {
g1 <- grep(paste('[',X[i],']'),Y[i])
length(g1>0)
})
})
data.frame(dif.val = as.vector(dif.mat), dif.sig =
as.vector(sig.mat),Week=Week,descriptor=descriptor)
})
diffs1
}

likedif <- extract.diff("CLiking")

likedif <- likedif[likedif$dif.val>=0,]

g <- ggplot(likedif, aes(dif.val,dif.sig))

g + geom_jitter(aes(colour=Week),position=position_jitter(height=.05))

The plot shows a difference of between 0.24 and 0.26 is enough to be significant. For Juiciness, the pattern is the same:

likedif <- extract.diff("CJuiciness")

likedif <- likedif[likedif$dif.val>=0,]

g <- ggplot(likedif, aes(dif.val,dif.sig))

g + geom_jitter(aes(colour=Week),position=position_jitter(height=.05))

In juiciness a difference of 0.24 is enough to be significant. Given all, a difference of 0.24 can be used for both variables.

Liking vs. Juiciness

The plot with errors is easy to make. For completeness a local fit is added.

dataval <- datain

vars <- names(dataval)[-1]

for (descriptor in vars) {

dataval[,descriptor] <- as.numeric(gsub('[[:alpha:]]','',dataval[,descriptor]))

}

l1 <- locfit(CLiking ~ lp(CJuiciness,nn=1),data=dataval)

topred <- data.frame(CJuiciness=seq(3.6,4.8,.1))

topred$CLiking <- predict(l1,topred)

g <- ggplot(dataval,aes(CJuiciness,CLiking))

g <- g + geom_point() + geom_errorbar(aes(ymin=CLiking-.24,ymax=CLiking+.24))

g <- g + geom_errorbarh(aes(xmin=CJuiciness-.24,xmax=CJuiciness+.24))

g <- g + geom_line(data=topred,colour='blue')

Both the local fit and the errors suggest that curvature is interesting to pursue. On top of that, a linear relation has as implication that any increase in juiciness is good. In general an optimum level is expected. Compare with sugar, if you like two lumps of sugar, two is dislike as not sweet enough, while four is too sweet. Hence, again all reason for curvature.

Regarding inclusion of extra variables, the data shows that the products Juiciness 4.1 are almost significant different. Given that this significance is Tukey HSD, a difference on the one point with much lower liking is probably relevant. On the other hand, the data is fairly well described by curvature, so one extra explaining variable should be enough.

Adding an extra explaining variable

The two prime candidates as second explaining variable are according to the previous calculation CSweetness and CMealiness. In general one would expect apples that are juicy to not be mealy so there is a reason to avoid CMealiness. Nevertheless both are investigated.

l1 <- locfit(CLiking ~ lp(CJuiciness,CSweetness,nn=1.3),data=dataval)

topred <- expand.grid(CJuiciness=seq(3.8,4.6,.1),CSweetness=seq(3.2,3.9,.1))

topred$CLiking <- predict(l1,topred)

v <- ggplot(topred, aes(CJuiciness, CSweetness, z = CLiking))

v <- v + stat_contour(aes(colour= ..level..) )

v + geom_point(data=dataval,stat='identity',position='identity',aes(CJuiciness,CSweetness))

l1 <- locfit(CLiking ~ lp(CJuiciness,CMealiness,nn=1.3),data=dataval)

topred <- expand.grid(CJuiciness=seq(3.8,4.6,.1),CMealiness=seq(1.4,2.1,.1))

topred$CLiking <- predict(l1,topred)

v <- ggplot(topred, aes(CJuiciness, CMealiness, z = CLiking))

v <- v + stat_contour(aes(colour= ..level..) )

v + geom_point(data=dataval,stat='identity',position='identity',aes(CJuiciness,CMealiness))

The link between CMealiness and CJuiciness is quite strong. It is also clear that CMealiness does not explain the large difference in liking at CJuiciness 4.1. Hence CSweetness is chosen. Not the best of statistical reasons, but all in all it feels like the better model

Simplified linear model

Finally, even though I like the local model, it is more convenient to use a simple linear model. After reduction of non-significant terms, only three factors are left. CJuiciness, CJuiciness^2 and CSweetness.

l1 <- lm(CLiking ~ CJuiciness*CSweetness + I(CJuiciness^2) + I(CSweetness^2),data=dataval)

summary(l1)

Call:

lm(formula = CLiking ~ CJuiciness * CSweetness + I(CJuiciness^2) +

I(CSweetness^2), data = dataval)

Residuals:

Min 1Q Median 3Q Max

-0.12451 -0.02432 0.01002 0.02591 0.06914

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -6.5326 19.8267 -0.329 0.7475

CJuiciness 6.7662 4.0414 1.674 0.1199

CSweetness -2.5410 8.7807 -0.289 0.7772

I(CJuiciness^2) -0.8638 0.3564 -2.424 0.0321 *

I(CSweetness^2) 0.1264 0.9538 0.133 0.8968

CJuiciness:CSweetness 0.3321 0.6574 0.505 0.6225

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.05827 on 12 degrees of freedom

Multiple R-squared: 0.9107, Adjusted R-squared: 0.8735

F-statistic: 24.48 on 5 and 12 DF, p-value: 6.589e-06

l1 <- lm(CLiking ~ CJuiciness+CSweetness + I(CJuiciness^2) ,data=dataval)

summary(l1)

Call:

lm(formula = CLiking ~ CJuiciness + CSweetness + I(CJuiciness^2),

data = dataval)

Residuals:

Min 1Q Median 3Q Max

-0.125337 -0.023834 0.004955 0.024922 0.087851

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -14.3609 5.1169 -2.807 0.01400 *

CJuiciness 8.5997 2.3715 3.626 0.00275 **

CSweetness -0.2683 0.1104 -2.430 0.02916 *

I(CJuiciness^2) -0.9428 0.2795 -3.373 0.00455 **

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.0547 on 14 degrees of freedom

Multiple R-squared: 0.9082, Adjusted R-squared: 0.8885

F-statistic: 46.17 on 3 and 14 DF, p-value: 1.654e-07

topred <- expand.grid(CJuiciness=seq(3.8,4.6,.05),CSweetness=seq(3.2,3.9,.05))

topred$CLiking <- predict(l1,topred)

v <- ggplot(topred, aes(CJuiciness, CSweetness, z = CLiking))

v <- v + stat_contour(aes(colour= ..level..) )

v + geom_point(data=dataval,stat='identity',position='identity',aes(CJuiciness,CSweetness))

The resulting plot shows quite some difference with the local fit, but this is mostly at the regions without data.

Hence it is concluded that liking of apples depends mainly on Juiciness, and somewhat on Sweetness. Above a certain Juiciness no gain is to be made. Lower sweetness gives better liking

Discussion

It is a bit disappointing that the error in CJuiciness and CSweetness is not incorporated in the model. Unfortunately, this is easier said than done. The keyword here is Total Least Squares also named Deming regression. Unfortunately these are only viable if two variables are regressed. Curvature is also outside of the scope.

In addition, this leads to the question, what is error?. The 'error' between the scores consists of different parts. Differences between slices of apple, differences in sensory perception and different ways to score. Regarding the model, differences in slices of apple and sensory perception are counted in liking, while scoring error is not. Ideally, these would be split. This is rather a tall order.

Hence, two questions are remaining

what drives CJuiciness?
what drives CSweetness?

Thursday, March 15, 2012

Liking of apples - some data to link

I browsed through a paper by Peneau et al. (J. Sensory Studies, 2007) where they have nice data on apples; consumer evaluation, sensory evaluation and instrumental measurements. I think these are interesting data to examine if these variable blocks can be linked. This linking is a big thing in sensory science. In this post it is shown that consumers evaluation of juiciness is the main determining factor regarding liking (driver of liking).
Data
The data is given in three tables, giving averages over storage conditions for six cultivars for two storage times. Three cultivars were replicated. Since no data cultivar*storage condition is available, I will ignore the storage condition. Significant differences were indicated in the data tables. These I added when entering the data. The top left part of the data table:

library(xlsReadWrite)
datain <- read.xls('condensed.xls')
datain[1:5,1:5]

Products CLiking CFreshness CCrispness CJuiciness

1 Ariwa_W1 4.19ab 4.25a 4.39ab 4.14cd
2 Elstar_W1 4.25a 4.01ab 3.84d 4.32bcd
3 Jonagold_W1 4.31a 4.14a 4.35ab 4.56a
4 Gala_W1 4.19ab 4.08a 4.24bc 4.36abcd
5 Topaz_W1 4.35a 4.11a 4.59b 4.37abc

In this table the final part of the product name is the storage duration. The first character of the variables indicates the source. 'C' indicates this is consumer data. 'S' is used for sensory data and 'A' for analytical chemical data. To make the data ready storage condition (bag/net) and the significant differences are removed.

datain <- datain[-grep('bag|net',datain$Products,ignore.case=TRUE),]

#convert strings into numbers

vars <- names(datain)[-1]

for (descriptor in vars) {

datain[,descriptor] <- as.numeric(gsub('[[:alpha:]]','',datain[,descriptor]))

}

Main driver of liking

Random forests are my preferred way to get a quick view of the most important effects. They do not worry about more variables than objects and do not imply a linear relation.

#remove missing data and names

data2 <- datain[-1,-1]

rf1 <- randomForest(CLiking ~ .,data=data2,importance=TRUE)

varImpPlot(rf1)

The plot shows CJuiciness (consumer score for juiciness) is the main driver of liking. Indeed the effect is clear when plotting Cliking against CJuiciness.

plot(CLiking ~ CJuiciness,data=datain)

The plot gives rise to two questions;

Is the relation linear or slightly curved?
The variation in liking around CJuiciness is large. Are more explanatory variables needed
So, what drives CJuiciness?

Monday, March 12, 2012

R index between two products is somewhat dependent on other products

I explained earlier how R-index is used in sensory is used to examine ranking data. The legitimization to use R-index is in the link with d' and with Mann-Whitney statistic. In this post I show there is a dependence on the number of products and position of other products on the R index. It is a small effect. However, if data is analyzed by looking only and rigidly at the p value, then the result might chance from just under significant to just over significance.

Using simulations, I will show that presence of other samples influences the R-index. I think this effect occurs because the R index is, mathematically, calculated from an aggregated matrix of counts of product against ranks. It is my feeling, that when there are more products, there are less chances to get equal rankings than with few products and hence slightly different scores.
R index calculation
Below the calculation when comparing 2 products from a total of 4
R index calculation matrix
rank 1 rank 2 rank 3 rank 4
product 1 a b c d
product 2 e f g h
Note a to h are the counts in the respective cells.
The R index is composed of three parts:
1 The number of wins of product 1 over product 2:
a*(f+g+h) + b*(g+h) + c*h
2 The number of equal rankings divided by two
(a*e + b*f + c*g + d*h) /2
3 Normalization
(a+b+c+d)*(e+f+g+h)
R index = 100* wins*equal / normalization
Effect of number of products
Figure 1 shows the simulation R index dependence on the number of products, using a ranking with 25 panelists. With a low number of products, the distribution of the R index is a bit wider than with more products. Most of the difference in distribution is in the region 3 to 6 products, which is also the number of products often used in sensory.
(Critical values of R-indices are given by the red and blue lines (Bi and O'Mahony 1995 respectively 2007, Journal of Sensory Studies))

Effect of neighborhood of other products
Figure 2 shows the dependence on location of the other products. I have chosen 5 products, two have the same location. The other 3 move away from this location. Again 25 panelists. In this figure it shows that the two products R-index has a smaller distribution under H0 (no product differences) when all products are similar. This is about the same as the 5 products in the first plot. When the other products are far away, the distribution becomes wider, getting closer to the 3 product distribution in figure 1.

It should be written that with one product rather than three products moving away from the centre location the effect is smaller. Effect of number of panelists is for a next post.
Code for figure 1:

library(ggplot2)

makeRanksNoDiff <- function(nprod,nrep) {
inList <- lapply(1:nrep,function(x) sample(1:nprod,nprod) )
data.frame(person=factor(rep(1:nrep,each=nprod)),
prod=factor(rep(1:nprod,times=nrep)),
rank=unlist(inList))
}

tab2Rindex <- function(t1,t2) {
Rindex <- crossprod(rev(t1)[-1],cumsum(rev(t2[-1]))) + 0.5*crossprod(t1,t2)
100*Rindex/(sum(t1)*sum(t2))
}

FastAllRindex <- function(rankExperiment) {
crst <- xtabs(~ prod + rank,data=rankExperiment)
nprod <- nlevels(rankExperiment$prod)
Rindices <- unlist( lapply(1:(nprod-1),function(p1) {
lapply((p1+1):nprod,function(p2) tab2Rindex(crst[p1,],crst[p2,])) }) )
Rindices
}

nprod <- seq(3,25,by=1)
last <- lapply(nprod,function(xo) {
nsamples <- ceiling(10000/xo)
li <- lapply(1:nsamples,function(xi) {
re <- makeRanksNoDiff(nprod=xo,nrep=25)
FastAllRindex(re)
})
li2 <- as.data.frame(do.call(rbind,li))
li2$nprod <- xo
li2
} )

last2 <- lapply(last,function(x) {
qq <- quantile(as.matrix(x[,grep('nprod',names(x),invert=TRUE)]) ,c(0.025,.5,.975))
qq <- as.data.frame(t(qq))
qq$nprod <- x$nprod[1]
qq
} )

summy <- do.call(rbind,last2)
g1 <- ggplot(summy,aes(nprod,`50%`) )
g1 <- g1+ geom_errorbar(aes(ymax = `97.5%`, ymin=`2.5%`))
g1 <- g1 + scale_y_continuous(name='R-index' )
g1 <- g1 + scale_x_continuous(name='Number of products to compare')
g1 <- g1 + geom_hline(yintercept=50 + 18.57*c(-1,1),colour='red')
g1 <- g1 + geom_hline(yintercept=50 + 15.21*c(-1,1),colour='blue')

g1

Additional code for figure 2

makeRanksDiff <- function(prods,nrep) {

nprod <- length(prods)

inList <- lapply(1:nrep,function(x) rank(rnorm(n=nprod,mean=prods)))

data.frame(person=factor(rep(1:nrep,each=nprod)),

prod=factor(rep(1:nprod,times=nrep)),

rank=unlist(inList))

}

location <- seq(0,3,by=.25)

last <- lapply(location,function(xo) {

li <- sapply(1:10000,function(xi) {

re <- makeRanksDiff(prod=c(0,0,xo,xo,xo),nrep=25)

crst <- xtabs(~ prod + rank,data=re)

tab2Rindex(crst[1,],crst[2,])

})

li2 <- data.frame(location=xo,Rindex=li)

li2

} )

last2 <- lapply(last,function(x) {

qq <- quantile( x$Rindex,c(0.025,.5,.975))

qq <- as.data.frame(t(qq))

qq$location <- x$location[1]

} )

summy <- do.call(rbind,last2)

g1 <- ggplot(summy,aes(location,`50%`) )

g1 <- g1+ geom_errorbar(aes(ymax = `97.5%`, ymin=`2.5%`))

g1 <- g1 + scale_y_continuous(name='R-index between equal products' )

g1 <- g1 + scale_x_continuous(name='Location of odd products')

g1 <- g1 + geom_hline(yintercept=50 + 18.57*c(-1,1),colour='red')

g1 <- g1 + geom_hline(yintercept=50 + 15.21*c(-1,1),colour='blue')

Saturday, March 10, 2012

Detour in taste wordclouds

I read Mining Twitter for consumer attitudes towards hotels in my feed of R-bloggers. That reminded me that I intended to look at generating wordclouds for salt and MSG at some point. Salt, or sodium is linked to hypertension, which is linked to some diseases http://en.wikipedia.org/wiki/Complications_of_hypertension. It is a topic within governments and health organizations, but I have the feeling it is not so much an issue in the public. MSG, or mono sodium glutamate, is not an issue for the governments of health organisations, but has a bad name and is for some linked to the chinese restaurant syndrom. Luckily there was an nice post to follow: Generating Twitter Wordclouds in R.
Salt
Neither @Salt nor #Salt are good when interested in salt taste. Hence the search is for #sodium

sodium.tweets <- searchTwitter('#sodium',n=1500)
sodium.texts <- laply(sodium.tweets, function(x) x$getText())
head(sodium.texts)
[1] "#Citric Acid #Sodium Bicarbonate http://t.co/QgJxSlGT HealthAid Vitamin C 1000mg - Effervescent (Blackcurrant Flavour) - 20 Tablets"
[2] "I dnt understand metro I can go on Facebook an Twitter but I can't call or text anybody #sodium"
[3] "Get the facts on #sodium:http://t.co/Djc9rTEl #BCHC @TheHSF"
[4] "#Sodium: How to tame your salt habit now? http://t.co/eFTl8yI1"
[5] "#lol #funny #insta #instafunny #haha #smile #meme #chemistry #joke #sodium http://t.co/pX404RhQ"
[6] "@Astroboii07 #sodium. Haha. Tas bisaya daw. i-sudyum. Hahaha. @andiedote @krizhsexy @mjpatingo #building"
At this point I found the blog twitter to wordcloud, so I restarted and used those functions. The original is from Using Text Mining to Find Out What @RDataMining Tweets are About. There was a small bit of editing. Require(tm) and require(wordcloud) within the functions did not work, so I called on the libraries directly. The clouds had some links in them, shown as 'httpt' with some more text added (link to a chemistry joke) a function to remove those is added too.
library(tm)
library(wordcloud)

RemoveAtPeople <- function(tweet) {

gsub("@\\w+", "", tweet)
}

RemoveHTTP <- function(tweet) {
gsub("http[[:alnum:][:punct:]]+", "", tweet)
}

generateCorpus= function(df,my.stopwords=c()){
#The following is cribbed and seems to do what it says on the can
tw.corpus= Corpus(VectorSource(df))
# remove punctuation
tw.corpus = tm_map(tw.corpus, removePunctuation)
#normalise case
tw.corpus = tm_map(tw.corpus, tolower)
# remove stopwords
tw.corpus = tm_map(tw.corpus, removeWords, stopwords('english'))
tw.corpus = tm_map(tw.corpus, removeWords, my.stopwords)
tw.corpus
}
wordcloud.generate=function(corpus,min.freq=3){
doc.m = TermDocumentMatrix(corpus, control = list(minWordLength = 1))
dm = as.matrix(doc.m)
# calculate the frequency of words
v = sort(rowSums(dm), decreasing=TRUE)
d = data.frame(word=names(v), freq=v)
#Generate the wordcloud
wc=wordcloud(d$word, d$freq, min.freq=min.freq)
wc
}
tweets.grabber=function(searchTerm,num=500){
rdmTweets = searchTwitter(searchTerm, n=num,.encoding='UTF-8')
tw.df=twListToDF(rdmTweets)
tweets <- as.vector(sapply(tw.df$text, RemoveAtPeople))
as.vector(sapply(tweets,RemoveHTTP))
}

tweets=tweets.grabber('sodium',num=500)

tweets <- tweets[-308] # tweet in wrong locale

wordcloud.generate(generateCorpus(tweets,'sodium'),3)

The ugly line which removed tweed 308 is because this is in the wrong locale. It gave an error. This is an error which is not simple to resolve, so I removed the offending tweet: R tm package invalid input in 'utf8towcs'

Error in FUN(X[[308L]], ...) :

invalid input 'That was too much sodium 😞' in 'utf8towcs'

From the cloud we learn that even within the sodium tweets fat is as important as salt and linked. Using grep('fat',tweets,value=TRUE,ignore.case=TRUE)
"I'm higher then a fat bitch sodium"
"Beware of Salt! How to protect your health and shrink your fat. "
Low has some advertisements and behaviour
"Mothers Quick Cooking Barley, 11-Ounce Unit (Pack of 12): Quick cooking. Good source of fiber. Low fat, sodium f... "
"Chef's Pride Beef Flavored Base, Low Sodium, 16-Ounce Tubs (Pack of 12):"

"I cut alcohol for about two months & started eating only natural/organic food. No processes junk & low sodium! I cook for myself"
Blood has positive and negative tweets.

"RT : Reduce blood pressure by paying attention to the sodium content in the food you buy. The salt you add is minimal in comparison."
" this is corned beef season!!!! I'm ready for the mass amounts of sodium and the alarming spike in my blood pressure"
Pressure even though not large, is also mixed with lighting

"Hydroponic Indoor Grow Light Bulb Lamp - 1000 Watt High Output HPS - High Pressure Sodium: "

"RT : Obesity, a high-salt, high-fat diet, and lack of regular exercise can all amp up the blood pressure. "

MSG

MSG is another word which cannot be used in a twitter search. It is an abbreviation of message. Hence the search is for glutamate. It was needed to remove the words msg and monosodium out of the feed on top of glutamate

tweetsMSG=tweets.grabber('glutamate',num=1500)
tweetsMSG <- tweetsMSG[-591]
wordcloud.generate(generateCorpus(tweetsMSG,c('glutamate','msg','monosodium')),3)

Stress is a bit of a surprise to me. "Loss of glutamate receptor linked to negative effects of chronic stress " .
The negative words are much smaller, related to the story that glutamate is an excitotoxitin, which passes the bloodbrain barrier with all the negative effects of such. Surprising also a positive tweet in this context.
"How the Excitotoxins Glutamate and Aspartame Affect Our Health"
"Glial cells may protect against excitotoxicity by transporting excess glutamate across the blood-brain barrier ".
It is also told that glutamate is hidden.
"Hidden names for MSG - Hydrolyzed protein, glutamate, hydrolyzed soy, yeast extract, caseinate, spices, natural flavorings, vinegar powder".
I feel sure these all have links which are stripped from the texts. It is clear there is an active group of people looking at MSG. To repeat, this is not so much from health authorities, check for instance the WHO.
NatitionalNutritionMonth
NatuionalNutritionMonth is a word I encountered examining the tweets. What do we learn about it?

tweetsNNM=tweets.grabber('#NationalNutritionMonth',num=1500)
wordcloud.generate(generateCorpus(tweetsNNM,'nationalnutritionmonth'),3)

From the cloud I understand that March is the national nutrition month. It is about healthy eating, nutrition and eating the right things. Nothing about MSG some tweets about salt: "RT . It's #NationalNutritionMonth: how much fat, salt, and sugar should you eat? " Clearly this is pushed by health authorities, but that is obvious from the word NationalNutrionMonth already.