Sunday, June 29, 2014

stone flakes IV

In this post I want to try something new, a causal graphical model. The aim here is just as much to get myself a feel what these things do as to understand how the stone flakes data fit together.

Data

Data are stone flakes data which I analyzed previously. The first post was clustering, second linking to hominid type, third regression. Together these made for the bulk of a standard analysis. In this new analysis the same starting data is used.
r2 <- read.table('StoneFlakes.txt',header=TRUE,na.strings='?')
r1 <- read.table('annotation.txt',header=TRUE,na.strings='?')
r12 <- merge(r1,r2)

Packages

The main package used is pcalg (Methods for graphical models and causal inference). Even though it lives on cran, it requires RBGL (An interface to the BOOST graph library) which lives on Bioconductor. Plots are made via Rgraphvis (Provides plotting capabilities for R graph objects), Bioconductor again, which itself has the hard work done by graphviz, which, on my linux machine, is a few clicks to install. 
library('pcalg')
library('Rgraphviz')


First analysis

I am just following the vignette here, to get some working code.
rx <- subset(r12,,names(r2)[-1])
rx <- rx[complete.cases(rx) & !(r12$ID %in% c('ms','c','roe','sz','va','arn')),]
suffStat <- list(C = cor(rx), n = nrow(rx))
pc.gmG <- pc(suffStat, indepTest = gaussCItest,
    p = ncol(rx), alpha = 0.01)
png('graph1.png')
plot(pc.gmG, main = "")


personally I dislike this plot since you have to know which variable is which number. I don't think that is acceptable for things one wants to share. Since I could not find documentation how to modify this via the plot statement, I took the ugly road of directly modifying an S4 object; pc.Gmc.
pc.gmG@graph@nodes <- names(rx)
names(pc.gmG@graph@edgeL) <- names(rx)
png('graph2.png')
plot(pc.gmG, main = "")
dev.off()

This makes some sense looking at the variable names.
RTI (Relative-thickness index of the striking platform) is connected to WDI (Width-depth index of the striking platform). PSF (platform primery (yes/no, relative frequency)) is related to FSF (Platform facetted (yes/no, relative frequency)). PSF is also related to PROZD (Proportion of worked dorsal surface (continuous)) which then goes to ZDF1 (Dorsal surface totally worked (yes/no, relative frequency)). ZDF1 is also influenced by FLA (Flaking angle (the angle between the striking platform and the splitting surface)).

Second analysis

Much as like this analysis, it does not lead to a connection between flakes on one hand and age or group on the other hand. Since the algorithm assumes normal distributed variables, group is out of the question. Log(-age) seems to be closest to normal distributed.
rx <- subset(r12,,names(r2)[-1])
rx$lmage <- log(-r12$age)
rx <- rx[complete.cases(rx) & !(r12$ID %in% c('ms','c','roe','sz','va','arn')),]
suffStat <- list(C = cor(rx), n = nrow(rx))
pc.gmG <- pc(suffStat, indepTest = gaussCItest,
    p = ncol(rx), alpha = 0.01)
pc.gmG@graph@nodes <- names(rx)
names(pc.gmG@graph@edgeL) <- names(rx)
plot(pc.gmG, main = "")


T

Adding age links the two parts, while keeping most of the previous graph unchanged. The causal link however, seems reversed, does age cause change in flakes or do changes in flakes cause age? Nevertheless, it does show a different picture than before. In linear regression FSF and LBI contributed, but there I had not removed the outliers. In this approach FSF features, but is in its turn driven by PSF. The other direct influence is ZDF1, which is now also driven by WDI.

Third analysis

It is probably pushing the limits of what is normal distributed, but there are two binairy variables, stone material (1=flint, 2=other) and site (1=gravel pit, 0=other).
rx <- subset(r12,,c('site','mat',names(r2)[-1]))
rx$lmage <- log(-r12$age)
rx <- rx[complete.cases(rx) & !(r12$ID %in% c('ms','c','roe','sz','va','arn')),]
suffStat <- list(C = cor(rx), n = nrow(rx))
pc.gmG <- pc(suffStat, indepTest = gaussCItest,
    p = ncol(rx), alpha = 0.01)
pc.gmG@graph@nodes <- names(rx)
names(pc.gmG@graph@edgeL) <- names(rx)
plot(pc.gmG, main = "")

This makes a link from material to RTI (relative thickness) and a connection site to FLA (flaking angle), see below.
# site (1=gravel pit, 0=other)
boxplot(FLA ~ c('other','gravel pit')[site+1],
    data=r12,
    ylab='FLA',
    xlab='site')

#stone material (1=flint, 2=other)
boxplot(RTI ~ c('flint','other')[mat],
    data=r12,
    ylab='RTI',
    xlab='mat')

Sunday, June 22, 2014

stone flakes III

Stone flakes are waste products from the tool making process in the stone age. This is the second post, first post was clustering, second linking to hominid type. The data also contains a more or less continuous age variable, which gives possibility to use regression, which is the topic of this week.

Data

Data source see first post. Regarding age especially: 'in millenia (not to be taken too seriously) ... mode of dating (geological=more accurate, typological)'. On inspection, it seems most of the typological dates are 200k, hence it seemed an idea to ignore typological. In  addition, age is negative. Since at some point I want to use a Box-Cox transformation, for which positive is required, -age is added.
r2 <- read.table('StoneFlakes.txt',header=TRUE,na.strings='?')
r1 <- read.table('annotation.txt',header=TRUE,na.strings='?')
r12 <- merge(r1,r2)
r12$Group <- factor(r12$group,labels=c('Lower Paleolithic',
        'Levallois technique',
        'Middle Paleolithic',
        'Homo Sapiens'))
r12$mAge <- -r12$age
r12c <- r12[complete.cases(r12),]
table(r12$age ,r1$dating )
       geo typo
  -400   2    0
  -300   8    1
  -200  13   12
  -130   1    0
  -120  12    0
  -80   23    2
  -40    3    0

Regression

The tool to start is linear regression. This shows there is a relation, mostly age with FSF and LBI. It is not a very good model, the standard error is at least 55 thousand years.
l1 <- lm(age ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,
    data=r12c[r12c$dating=='geo',])
summary(l1)
Call:
lm(formula = age ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + 
    PROZD, data = r12c[r12c$dating == "geo", ])

Residuals:
     Min       1Q   Median       3Q      Max 
-127.446  -30.646   -7.889   27.790  159.471 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) -328.8260   310.6756  -1.058   0.2953  
LBI          149.1334    65.8323   2.265   0.0281 *
RTI           -0.8196     2.8498  -0.288   0.7749  
WDI           16.2067    20.1351   0.805   0.4249  
FLA           -1.6769     1.8680  -0.898   0.3739  
PSF           -0.9222     1.0706  -0.861   0.3934  
FSF            1.9496     0.8290   2.352   0.0229 *
ZDF1           0.9537     1.1915   0.800   0.4275  
PROZD          1.2245     2.0068   0.610   0.5447  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 55.09 on 47 degrees of freedom
Multiple R-squared:  0.7062, Adjusted R-squared:  0.6562 
F-statistic: 14.12 on 8 and 47 DF,  p-value: 3.23e-10
Some model validation can be made through the car package. It gives the impression that the error increases with age. It is not a particular strong effect, but then, the age range is not that large either, 40 to 400, which is a factor 10.
library(car)
par(mfrow=c(2,2))
plot(l1,ask=FALSE)
Since I know of no theoretical basis to chose a transformation, Box-Cox is my method of choice to proceed. Lambda zero, or log transformation, is certainly a choice which seems reasonable, hence it has been selected.
r12cx <- r12c[r12c$dating=='geo',]
summary(p1 <- powerTransform(
        mAge ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD, 
        r12cx))
bcPower Transformation to Normality 

   Est.Power Std.Err. Wald Lower Bound Wald Upper Bound
Y1    0.1211   0.1839          -0.2394           0.4816

Likelihood ratio tests about transformation parameters
                             LRT df         pval
LR test, lambda = (0)  0.4343012  1 5.098859e-01
LR test, lambda = (1) 21.3406407  1 3.844933e-06

Linear regression, step 2

Having chosen a transformation, it is time to rerun the model. It is now clear that FSF is most important and LBI a bit less.
r12cx$lAge <- log(-r12cx$age)
l1 <- lm(lAge ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,
    data=r12cx)
summary(l1)
Call:
lm(formula = lAge ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + 
    PROZD, data = r12cx)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.04512 -0.18333  0.07013  0.21085  0.58648 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept)  5.071224   2.030941   2.497  0.01609 * 
LBI         -0.953462   0.430357  -2.216  0.03161 * 
RTI          0.005561   0.018630   0.298  0.76664   
WDI         -0.041576   0.131627  -0.316  0.75350   
FLA          0.018477   0.012211   1.513  0.13695   
PSF          0.002753   0.006999   0.393  0.69580   
FSF         -0.015956   0.005419  -2.944  0.00502 **
ZDF1        -0.004485   0.007789  -0.576  0.56748   
PROZD       -0.009941   0.013119  -0.758  0.45236   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3601 on 47 degrees of freedom
Multiple R-squared:  0.6882, Adjusted R-squared:  0.6351 
F-statistic: 12.97 on 8 and 47 DF,  p-value: 1.217e-09

Plot of regression

With two independent variables, it is easy to make a nice plot:
l2 <- lm(lAge ~ LBI + FSF ,
    data=r12cx)
par(mfrow=c(1,1))
incont <- list(x=seq(min(r12cx$LBI),max(r12cx$LBI),length.out=12),
    y=seq(min(r12cx$FSF),max(r12cx$FSF),length.out=13))
topred <- expand.grid(LBI=incont$x,
    FSF=incont$y)
topred$p1 <- predict(l2,topred)        
incont$z <- matrix(-exp(topred$p1),nrow=length(incont$x))
contour(incont,xlab='LBI',ylab='FSF')
cols <- colorRampPalette(c('violet','gold','seagreen'))(4) 
with(r12cx,text(x=LBI,y=FSF,ID,col=cols[group]))

Predictions

I started in data analysis as a chemometrician, my method of choice for a predictive model with correlated independent variables is PLS. In this case one component seems enough (lowest cross validation RMSEP), but the model explains 60% of log(age) variability, which is not impressive.
library(pls)
p1 <- mvr(lAge ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,
    data=r12cx,
    method='simpls',
    validation='LOO',
    scale=TRUE,
    ncomp=5)
summary(p1)
Data: X dimension: 56 8 
Y dimension: 56 1
Fit method: simpls
Number of components considered: 5

VALIDATION: RMSEP
Cross-validated using 56 leave-one-out segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps
CV          0.6016   0.3714   0.3812   0.3864   0.3915   0.3983
adjCV       0.6016   0.3713   0.3808   0.3859   0.3910   0.3976

TRAINING: % variance explained
      1 comps  2 comps  3 comps  4 comps  5 comps
X       51.07    64.11    76.81    83.13    89.44
lAge    64.62    67.94    68.70    68.78    68.81
A plot shows there are a number of odd points, which are removed in the next section.
r12c$plspred <- -exp(predict(p1,r12c,ncomp=1))
plot(plspred ~age,ylab='PLS prediction',type='n',data=r12c)
text(x=r12c$age,y=r12c$plspred,r12c$ID,col=cols[r12c$group])
In an email from Thomas Weber, it was also indicated that there might be reasons to doubt the homonid group of a few inventories (rows); reasons include few flakes, changing insight and misfit from "impressionist technological" point of view. All inventories mentioned in that email are now removed. As can be seen, a two component PLS model is now preferred, and 80% of variance is explained.
p2 <- mvr(lAge ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,
    data=r12cx,
    method='simpls',
    validation='LOO',
    scale=TRUE,
    ncomp=5,
    subset= !(ID %in% c('ms','c','roe','sz','va','arn')))
summary(p2)

Data: X dimension: 53 8 
Y dimension: 53 1
Fit method: simpls
Number of components considered: 5

VALIDATION: RMSEP
Cross-validated using 53 leave-one-out segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps
CV          0.6088   0.3127   0.3057   0.3062   0.3070   0.3153
adjCV       0.6088   0.3126   0.3054   0.3059   0.3065   0.3140

TRAINING: % variance explained
      1 comps  2 comps  3 comps  4 comps  5 comps
X       52.14    63.94    77.05    79.82    84.77
lAge    75.44    79.66    80.31    81.00    81.56

Plot 

The plot below shows age in data versus the model predictions. It should be noted that after all steps, it is to be expected the data used to fit the model is predicted better than the other data. This is especially true for the suspected outliers which were removed, but also for inventories with dating method typological.
Having said that, the model really does not find inventories v1 and v2 to be as old as the data states. Perhaps there was no or less or different technological change, which is not picked by the model. In addition, the step between middle paleolithic and homo sapiens is not picked by the model. It is my personal suspicion that getting more homo sapiens data would improve all of the models which I have made in these three blog posts. As it is, the regression tree was the best tool to detect these inventories, which is a bit odd.
r12c$plspred2 <- -exp(predict(p2,r12c,ncomp=2))
plot(plspred2 ~age,ylab='PLS prediction',type='n',data=r12c)
text(x=r12c$age,y=r12c$plspred2,r12c$ID,col=cols[r12c$group])

Monday, June 16, 2014

Stone flakes II

Continuing from last week, the aim is now to classify the stone flakes based on their various properties. Three methods are used. LDA is an obvious standard. A classification tree is both simple and visually appealing. Random forest as a complex method, where more complex relations can easily be captured. Surprising with these data is that the classification tree is doing better than random forest with respect to predicting the input data.

Data

This is the same as last week. However, I now opted to make the group labels a bit more short.
r2 <- read.table('StoneFlakes.txt',header=TRUE,na.strings='?')
r1 <- read.table('annotation.txt',header=TRUE,na.strings='?')
r12 <- merge(r1,r2)
r12$Group <- factor(r12$group,labels=c('Lower Paleolithic',
    'Levallois technique',
    'Middle Paleolithic',
    'Homo Sapiens'))
r12c <- r12[complete.cases(r12),]

LDA

LDA is part of the MASS package. The settings are default. The plot function has been adapted, to retain proper labeling and colors.
ld1 <- lda(Group ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,
    data=r12c)
ld1
Call:
lda(Group ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD, 
    data = r12c)

Prior probabilities of groups:
  Lower Paleolithic Levallois technique  Middle Paleolithic        Homo Sapiens 
         0.15492958          0.32394366          0.47887324          0.04225352 

Group means:
                         LBI      RTI      WDI      FLA       PSF       FSF
Lower Paleolithic   1.105455 33.82727 2.690000 118.8182 31.881818  8.372727
Levallois technique 1.254348 28.74348 2.768261 121.2609 15.021739 14.900000
Middle Paleolithic  1.195000 25.88235 3.384412 113.9706  9.497059 23.294118
Homo Sapiens        1.453333 22.96333 3.150000 117.0000  7.666667 27.900000
                        ZDF1    PROZD
Lower Paleolithic   15.86364 57.81818
Levallois technique 30.46087 69.17391
Middle Paleolithic  62.31471 84.23529
Homo Sapiens        56.36667 87.00000

Coefficients of linear discriminants:
              LD1           LD2         LD3
LBI    2.63864596 -7.3053369000  4.96921267
RTI    0.03735899  0.0683327644  0.03955854
WDI    0.64616126 -0.3086473881  0.06507385
FLA   -0.04798439 -0.0591106950 -0.05229559
PSF   -0.01917272  0.0515112771  0.10178299
FSF    0.02488002 -0.0001579818  0.07080551
ZDF1   0.02896112  0.0396366883 -0.01625858
PROZD  0.03036974 -0.0079845555  0.04335520

Proportion of trace:
   LD1    LD2    LD3 
0.7026 0.2585 0.0389 
There are eight objects incorrect classified. Out of 71 complete cases, that is 10% wrong.
mydf <- data.frame(ID=r12c$ID,
    Group=r12c$Group,
    predict=predict(ld1)$class)
mydf[mydf$Group!=mydf$pred,]

    ID               Group             predict
9    c  Middle Paleolithic Levallois technique
10  cl   Lower Paleolithic Levallois technique
37  ms Levallois technique   Lower Paleolithic
38   n  Middle Paleolithic Levallois technique
45 roe  Middle Paleolithic Levallois technique
55  sz Levallois technique  Middle Paleolithic
61  va Levallois technique        Homo Sapiens
68 woe   Lower Paleolithic Levallois technique

LDA Plot

This is a small adaptation from the plot.lda function in MASS. At visual examination, it seems that the groups are not completely separated. Especially, I wonder if they are much better separated than last weeks biplot.
cols <- colorRampPalette(c('violet','gold','seagreen'))(4) 
myldaplot <- function (x, panel = panel.lda, ..., cex = 0.7, dimen, abbrev = FALSE, 
    xlab = "LD1", ylab = "LD2",indata,cols) 
{
  panel.lda <- function(x, y, ...) text(x, y, g, 
        cex = cex, ...)
  if (!is.null(Terms <- x$terms)) {
    data <- model.frame(x)
    X <- model.matrix(delete.response(Terms), data)
    g <- indata$ID
    mr <- cols[as.numeric(model.response(data))]
    xint <- match("(Intercept)", colnames(X), nomatch = 0L)
    if (xint > 0L) 
      X <- X[, -xint, drop = FALSE]
  }
  else {
    xname <- x$call$x
    gname <- x$call[[3L]]
    if (!is.null(sub <- x$call$subset)) {
      X <- eval.parent(parse(text = paste(deparse(xname, 
                      backtick = TRUE), "[", deparse(sub, backtick = TRUE), 
                  ",]")))
      g <- eval.parent(parse(text = paste(deparse(gname, 
                      backtick = TRUE), "[", deparse(sub, backtick = TRUE), 
                  "]")))
    }
    else {
      X <- eval.parent(xname)
      g <- eval.parent(gname)
    }
    if (!is.null(nas <- x$call$na.action)) {
      df <- data.frame(g = g, X = X)
      df <- eval(call(nas, df))
      g <- df$g
      X <- df$X
    }
  }
  if (abbrev) 
    levels(g) <- abbreviate(levels(g), abbrev)
  means <- colMeans(x$means)
  X <- scale(X, center = means, scale = FALSE) %*% x$scaling
  if (!missing(dimen) && dimen < ncol(X)) 
    X <- X[, 1L:dimen, drop = FALSE]
  if (ncol(X) > 2L) {
    pairs(X, panel = panel, ...,col=mr)
  }
  else if (ncol(X) == 2L) {
    eqscplot(X[, 1L:2L], xlab = xlab, ylab = ylab, type = "n", 
        ...)
    panel(X[, 1L], X[, 2L], ...)
  }
  else ldahist(X[, 1L], g, xlab = xlab, ...)
  invisible(NULL)
}
myldaplot(ld1,indata=r12c,cols=cols)

Classification tree

Rpart is my favorite classification tree implementation. The only tuning is setting minsplit to 10, the default of 20 seems a bit large for 79 objects and four categories. The printed output is skipped, since we have the plot. The interpretation is pretty simple. First split on ZDF1, to distinguish old from young (less old?). The young can be split on LBI to Middle paleolithic and homo sapiens. The old by PSF to Levoillas and Lower Paleolithic. A final split between middle and lower paleolithic shows that the differences are not clear cut.
library(rpart)
rp1 <- rpart(Group ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,
    data=r12,minsplit=10)
plot(rp1,margin=.1)
text(rp1, use.n = TRUE,cex=.8)
There are ten incorrect predicted objects. Note that it is difficult to compare this with LDA, since objects with missing data (e.g. sk misses FLA, PSF, FSF) are predicted with rpart, while they were removed prior to LDA analysis.

mydf <- data.frame(ID=r12$ID,
    Group=r12$Group,
    predict=predict(rp1,type='class'))
mydf[mydf$Group!=mydf$predict,]
    ID               Group             predict
6  bie Levallois technique  Middle Paleolithic
10   c  Middle Paleolithic Levallois technique
11  cl   Lower Paleolithic Levallois technique
42  ms Levallois technique   Lower Paleolithic
43   n  Middle Paleolithic Levallois technique
59  sk Levallois technique  Middle Paleolithic
62  sz Levallois technique  Middle Paleolithic
65  ta Levallois technique  Middle Paleolithic
76 woe   Lower Paleolithic Levallois technique
77 wol Levallois technique  Middle Paleolithic
Note that cross validation makes for a much worse model assessment; about one third are incorrectly predicted with leave one out.
xmat <- xpred.rpart(rp1,xval=79)
colSums(xmat!=do.call(cbind,lapply(1:5,function(x) as.numeric(r12$Group) )))
0.71428571 0.30304576 0.12371791 0.05832118 0.02182179 
        42         26         23         24         25 

Randomforest

Randomforest seems to predict slightly worse than the more simple methods, with OOB error around 21%. As might be expected, Homo Sapiens, with only 3 rows, is particularly difficult to classify. Similar to the classification tree ZDF1 is an important variable, but FLA and PROZD were not important in the classification tree but are in random forest.

library(randomForest)

rf1 <- randomForest(Group ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,
   data=r12c,
   importance=TRUE,
   nodesize=10)
rf1
Call:
 randomForest(formula = Group ~ LBI + RTI + WDI + FLA + PSF +      FSF + ZDF1 + PROZD, data = r12c, importance = TRUE, nodesize = 10)
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 23.94%
Confusion matrix:
                    Lower Paleolithic Levallois technique Middle Paleolithic
Lower Paleolithic                   8                   3                  0
Levallois technique                 3                  17                  3
Middle Paleolithic                  1                   4                 29
Homo Sapiens                        0                   0                  3
                    Homo Sapiens class.error
Lower Paleolithic              0   0.2727273
Levallois technique            0   0.2608696
Middle Paleolithic             0   0.1470588
Homo Sapiens                   0   1.0000000
Wrong predicted:
mydf <- data.frame(ID=r12c$ID,
    Group=r12c$Group,
    predict=predict(rf1))
mydf[mydf$Group!=mydf$predict,]
    ID               Group             predict
6  bie Levallois technique  Middle Paleolithic
8   bo Levallois technique  Middle Paleolithic
9   by Levallois technique   Lower Paleolithic
10   c  Middle Paleolithic Levallois technique
11  cl   Lower Paleolithic Levallois technique
14  e2  Middle Paleolithic   Lower Paleolithic
21  g5  Middle Paleolithic Levallois technique
28 gue Levallois technique   Lower Paleolithic
40  ml   Lower Paleolithic Levallois technique
42  ms Levallois technique   Lower Paleolithic
43   n  Middle Paleolithic Levallois technique
50 roe  Middle Paleolithic Levallois technique
55 sa1        Homo Sapiens  Middle Paleolithic
56 sa2        Homo Sapiens  Middle Paleolithic
57 sa3        Homo Sapiens  Middle Paleolithic
62  sz Levallois technique  Middle Paleolithic


importanceplot


varImpPlot(rf1)

Friday, June 6, 2014

stone flakes

I browsed through UC Irvine Machine Learning Repository! the other day and noticed a nice data set regarding stone flakes produced by our ancestors, the prehistoric men. To quote the dataset owners:
'The data set concerns the earliest history of mankind. Prehistoric men created the desired shape of a stone tool by striking on a raw stone, thus splitting off flakes, the waste products of the crafting process. Archaeologists do not find many tools, but they do find flakes. The data set is about these flakes.' The question attached to the data is: 'Does the data reflect the technological progress during several hundred thousand years?'. This blog post does not tackle that question but first examines the data as multivariate set.

Data

The data consists of two data sets; one set for flake properties, where rows do not stand for single flakes but for whole inventories of them. The annotation data are associated properties, such as age and hominid type. There are 79 records. As can be seen for the first records, there are some missing data.
r2 <- read.table('StoneFlakes.txt',header=TRUE,na.strings='?')
head(r2)
   ID  LBI  RTI  WDI FLA  PSF  FSF ZDF1 PROZD
1  ar   NA 35.3 2.60  NA 42.4 24.2 47.1    69
2 arn 1.23 27.0 3.59 122  0.0 40.0 40.0    30
3  be 1.24 26.5 2.90 121 16.0 20.7 29.7    72
4 bi1 1.07 29.1 3.10 114 44.0  2.6 26.3    68
5 bi2 1.08 43.7 2.40 105 32.6  5.8 10.7    42
6 bie 1.39 29.5 2.78 126 14.0  0.0 50.0    78
r1 <- read.table('annotation.txt',header=TRUE,na.strings='?')
head(r1)
   ID group  age dating mat region site number
1  ar     3 -120    geo   2      d    0     34
2 arn     2 -200   typo   1    mit    1      5
3  be     2 -200   typo   1    mit    1    331
4 bi1     1 -300    geo   1    mit    0   4111
5 bi2     1 -300    geo   2    mit    0     77
6 bie     2 -200    geo   1    mit    1      8

Density plots

To get an idea about the data I have made density plots. For compactness lattice plots are used. The reshape is just a preparation for that. From data perspective, the homo sapiens group is pretty small has a small data range.

r12 <- merge(r1,r2)
# density
cols <- colorRampPalette(c('violet','gold','seagreen'))(4) 
library(lattice)
long <- reshape(r12,direction='long',
    v.names='Response',
    varying=list(colnames(r2)[-1]),
    idvar=c('ID','group'),
    timevar='Variable',
    times=colnames(r2)[-1])
densityplot(~ Response | Variable ,groups= group,
    data=long ,scales=list(relation='free'),
    col=cols,
    auto.key=list(
        text=c('Lower Paleolithic, Homo ergaster?, oldest',
            'Levallois technique',
            'Middle Paleolithic, probably Neanderthals',
            'Homo Sapiens, youngest'),
        col=cols,
        lines=FALSE))

Biplot

A bipot is easily made. However, I am a bit of a fan of the biplots detailed in Gower and Hand's book. Since the heavy lifting for that is now in package calibrate they are easily made.
r12c <- r12[complete.cases(r12),]
pr1 <- princomp(~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,
    r12c,
    cor=TRUE,
    scores=TRUE)
biplot(pr1,xlabs=r12c$ID)
Most of the following code is from calibrate's vignette. The colors in point labels are an annotation which I made. Unfortunately the textxy() function did not get color as I intended, so a for loop is made to get it correct. The length of the blue axis are made via trial and error. It should be noted that, similar to any biplot, there is some deformation, the axis are approximate.
library(calibrate)
X0<- subset(r12c,,c(LBI,RTI,WDI,FLA,PSF,FSF,ZDF1 ,PROZD))
X <- scale(X0)
rownames(X) <- r12c$ID
pca.results <- princomp(X,cor=FALSE)
Fp <- pca.results$scores
Gs <- pca.results$loadings
# no margins
par(mar=rep(0.05,4))
plot(Fp[,1],Fp[,2],
    pch=16,asp=1,
    xlim=c(-5,5),ylim=c(-5,5),
    frame.plot=TRUE,axes=FALSE,
    cex=0.5,type='n',
    col=cols[r12c$group])

for( ii in unique(r12c$group))
  textxy(Fp[r12c$group==ii,1],
      Fp[r12c$group==ii,2],
      rownames(X)[r12c$group==ii],
      cex=0.75,
      col=cols[ii],offset=0)

for (ii in 1:ncol(X)) {
  myseq <- seq(-2,2)
  if (colnames(X)[ii]=='LBI') myseq <-seq(-2,3)
  if (colnames(X)[ii] %in% c('RTI','FSF','PROZD')) myseq <-seq(-1.4,1.4)
  if (colnames(X)[ii]=='ZDF1') myseq <-seq(-1.5,2)
  ticklab <- pretty(myseq*attr(X,'scaled:scale')[ii]+attr(X,'scaled:center')[ii])
  
  ticklabc <- (ticklab-attr(X,'scaled:center')[ii])/attr(X,'scaled:scale')[ii]
  yc <- X[,ii]
  g <- Gs[ii,1:2]
  Calibrate.X3 <- calibrate(g,yc,ticklabc,Fp[,1:2],ticklab,tl=0.1,
      axislab=colnames(X)[ii],cex.axislab=0.75,where=1,labpos=4)
}

legend(x='topleft',
    legend=c('Lower Paleolithic, Homo ergaster?, oldest',
        'Levallois technique',
        'Middle Paleolithic, probably Neanderthals',
        'Homo Sapiens, youngest'),
    text.col=cols,
    ncol=1,cex=.75)

Hierarchical clustering

In the clustering it was chosen to use scaled data, just like the biplot. The reason is that the scales of the variables is quite different. The distance used is simple Euclidian, with average linkage. The code for colors in the dendrogam is not standard, but extracted from stackoverflow. 
ddi <- dist(X)
par(cex=.7)
hc <- hclust(ddi,method='average')

# adapted from http://stackoverflow.com/questions/18802519/label-and-color-leaf-dendrogram-in-r
labelCol <- function(x) {
  if (is.leaf(x)) {
    ## fetch label
    label <- attr(x, "label") 
    ## set label color to red for A and B, to blue otherwise
    attr(x, "nodePar") <- list(lab.col=cols[r12$group[r12$ID==label]],
        pch=46)
  }
  return(x)
}
## apply labelCol on all nodes of the dendrogram
dd <- dendrapply(as.dendrogram(hc,hang=.1), labelCol)

par(mar=c(3,.1,.1,2))
plot(dd,horiz=TRUE)

legend(x='topleft',
    legend=c('Lower Paleolithic, Homo ergaster?, oldest',
        'Levallois technique',
        'Middle Paleolithic, probably Neanderthals',
        'Homo Sapiens, youngest'),
    text.col=cols,
    ncol=1,cex=.75)

Interpretation

It would seem the data shows that the flakes shapes give a reasonable display of the groups, without using these groups as input information. This suggests that there is indeed a relation between flakes shape and time, which is for a future blog post.

Sunday, June 1, 2014

Autocorrelation in project Tycho's measles data

Project Tycho includes data from all weekly notifiable disease reports for the United States dating back to 1888. These data are freely available to anybody interested.I have looked at Ptoject Tycho's measles data before, general look, incidence, some high incidence data and correlation between states. After a detour, it is now time to look at the autocorrelations in these data. These show positive correlation at three years.

Data

first steps, see correlation between states

Preparation for autocorrelation

As detailed before, the data contain weekly counts. Summer has less incidence, winter more. For this reason calender year is abolished and a shifted year used (named cycle), which runs from summer to summer. The data, real world as they are, contain plenty of missings. Arbitrarily chosen, if a cycle has data from at least 40 weeks, then I will us this particular year.
r7 <- aggregate(
    r6$incidence,
    list(cycle=r6$cycle,
        State=r6$State),
    function(x)
        if(length(x)>40)
            sum(x) else
            NA)

To calculate an autocorrelation, a set of consecutive years are needed. Again arbitrarily chosen, 15 years is the minimum. As a first attempt I just kicked out all missing years. Since that resulted in sufficient states with data, no attempts to refine were made. As additional item the number of data points is stored. All in a nice list.
la <- lapply(levels(r7$State),function(x) {
        datain <- r7[r7$State==x,]
        datain <- datain[complete.cases(datain),]
        if (nrow(datain)==(1+max(datain$cycle)-min(datain$cycle)) &
            nrow(datain)>15)
            aa <- acf(datain$x,plot=FALSE,lag.max=6) else aa <- TRUE
        list(aa=aa,nr=nrow(datain))
    })


To make a plot, the autocorrelations are pulled out and it is all stuck in a dataframe.
la2 <- la[which(sapply(la,function(x) class(x$aa))=='acf')]
scfs <- as.data.frame(t(sapply(la2,function(x) as.numeric(x$aa$acf))))
scfs$state <- levels(r7$State)[sapply(la,function(x) class(x$aa)=='acf')]
scfs$n <- sapply(la2,function(x) x$nr)

And a reshape prior to plotting.
tc <- reshape(scfs,
    idvar=c('state','n'),
    varying=list(names(scfs[1:7])),
    timevar='lag',
    times=0:6,
    direction='long',
    v.names='acf')

Plot

The plot shows a somewhat negative correlation after one year and a positive correlation after 3 years. The only state not to show the positive correlation after 3 years had much less data, hence for conclusion I ignore that result.
library(ggplot2)
ggplot(tc,
        aes(y=acf, x=lag,group=state,col=state) )+
    geom_line(aes(size = n)) +
    scale_size(range = c(0.5, 3))+
    theme(text=element_text(family='Arial'))

References

Willem G. van Panhuis, John Grefenstette, Su Yon Jung, Nian Shong Chok, Anne Cross, Heather Eng, Bruce Y Lee, Vladimir Zadorozhny, Shawn Brown, Derek Cummings, Donald S. Burke. Contagious Diseases in the United States from 1888 to the present. NEJM 2013; 369(22): 2152-2158.