Wiekvoet: July 2014

Saturday, July 26, 2014

Guns are Cool - States

Last week I looked at time effects of the shootingtracker database. This week I will look at the states. Some (smaller) states never made it on the database. Other states, far too frequently. The worst of these California. After correcting for population size, Washington DC with 0.8 shootings per year per 100000 inhabitants sticks out. Subsequently, the observed incidences have been shrunk towards the common distribution. In these posterior estimates Massachusetts is best while Washington DC, Illinois and Louisiana are worst.

Data

Data are from the shootingtracker database and have been updated July 25. The last entry in the database is July 25. The script is a slightly adapted version from last week. Most notably, I did not notice last week the database uses KS for Kansas, but KA is what my other data uses. Note that in order make this post readable, all code is at a separate section.

Adding state information

The first thing is to get one record per state. The datasets package has state information, while number of inhabitants (2013 data) is plucked from census.gov. After merging some states have missing counts for shootings absence in the shootingtracker database is taken as 0 shootings.

Incidence is calculated per 100 000 inhabitants per year (365 days).

Calculating posteriors

Assuming a beta prior and binomial distribution for number of shootings, Jags is used to estimate posteriors. The one noticeable thing in the model is the prior for b. b really wants to be high, hence the prior is quite a bit wider than usual. The current values seem to be acceptable, see also the output at the bottom of this post.

The plot shows a number of things. In red are the observed values. In black the posterior values; means and 90% intervals of the samples. The blue line is the value to which the data is shrunk. Washington DC is shrunk quite a bit, because it has relatively few people and a large value. The same is true for Delaware. Illinois on the other hand has many inhabitants, hence is shrunk much less. It also has a more narrow interval. On the other side, states like Wyoming are shrunk upwards. Again, more inhabitants means less shrinking, Massachusetts ends up as having least number of shootings. Really big states, New York, Texas and California, are not shrunk at all and have quite narrow intervals. Texas and NY on the good side, California on the bad side.

The map shows the posteriors in a different way. At this point it might be worth noting that the north east is actually doing quite well, but since these states have so little inhabitants they are shrunk upwards and look bad. The current approach does not provide a natural way to avoid this. In a year or two additional data may cause at least some change in posteriors of the smaller states, both in the top and the bottom of this list.

Code

reading data

r13 <- readLines('raw13.txt')
r14 <- readLines('raw14.txt')
r1 <- c(r13,r14)
r2 <- gsub('\\[[a-zA-Z0-9]*\\]','',r1)
r3 <- gsub('^ *$','',r2)
r4 <- r3[r3!='']
r5 <- gsub('\\t$','',r4)
r6 <- gsub('\\t References$','',r5)
r7 <- read.table(textConnection(r6),
sep='\t',
header=TRUE,
stringsAsFactors=FALSE)
r7$Location[r7$Location=='Washington DC'] <-
'WashingtonDC, DC'
r8 <- read.table(textConnection(as.character(r7$Location)),
sep=',',
col.names=c('Location','State'),
stringsAsFactors=FALSE)
r8$State <- gsub(' ','',r8$State)
r8$State[r8$State=='Tennessee'] <- 'TN'
r8$State[r8$State=='Ohio'] <- 'OH'
r8$State[r8$State %in% c('Kansas','KA')] <- 'KS'
r8$State[r8$State=='Louisiana'] <- 'LA'
r8$State[r8$State=='Illinois'] <- 'IL'
r8$State <- toupper(r8$State)
r7$StateAbb <- r8$State
r7$Location <- r8$Location
r7 <- r7[! (r7$State %in% c( 'PUERTORICO','PR')),]
r7$Date <- gsub('/13$','/2013',r7$Date)
r7$date <- as.Date(r7$Date,format="%m/%d/%Y")

state information

r7$one <- 1

agg1 <- with(r7,

aggregate(one,by=list(StateAbb=StateAbb),FUN=sum))

library(datasets)

states <- data.frame(StateAbb=as.character(state.abb),

StateArea=state.area,

State=as.character(state.name),

StateRegion=state.region,

Frost=state.x77[,'Frost']

)

# http://www.census.gov/popest/data/state/totals/2013/index.html

# data pre processed so the csv can be used as is.

inhabitants <- read.csv('NST-EST2013-01.treated.csv')

#put it all together

states <- rbind(states,data.frame(StateAbb='DC',

StateArea=68.3,

State='District of Columbia',

StateRegion= 'South',Frost=NA))

r9 <- merge(states,inhabitants)

r10 <- merge(r9,agg1,all=TRUE)

# No data in some states

r10$x[is.na(r10$x)] <- 0

r10$incidence <- 100000*r10$x/r10$Population*365/

as.numeric((max(r7$date)-min(r7$date)))

Jags model

library(R2jags)

datain <- list(

count=r10$x,

Population = r10$Population,

n=nrow(r10),

scale=100000*365/

as.numeric((max(r7$date)-min(r7$date))))

model1 <- function() {

for (i in 1:n) {

count[i] ~ dbin(p[i],Population[i])

p[i] ~ dbeta(a,b)

pp[i] <- p[i]*scale

}

a ~ dnorm(0,.000001) %_% T(0,)

b ~ dnorm(1e7,.00000000000001) %_% T(0,)

betamean <- scale * a/(a+b)

}

parameters <- c('a','b','betamean','pp')

jagsfit <- jags(datain,

model=model1,

parameters=parameters,

n.iter=20000,

n.burnin=1000,

n.chain=4,

progress.bar="gui")

jagsfit

Plot 1

samples <- jagsfit$BUGSoutput$sims.matrix[,5:55]
r10$pred <- apply(samples,2,mean)
r10$l05 <- apply(samples,2,function(x) quantile(x,.05))
r10$l95 <- apply(samples,2,function(x) quantile(x,.95))
library(ggplot2)
r11 <- subset(r10,,c(State,l05,l95,pred,incidence))
r11 <- r11[order(r11$pred),]
r11$State <- factor(r11$State,levels=r11$State)
p <- ggplot(r11, aes(y=pred, x=State))
limits <- aes(ymin =l05, ymax=l95)
dodge <- position_dodge(width=0.9)
p +geom_point(position=dodge, stat="identity") +
geom_errorbar(limits, position=dodge, width=0.25) +
theme(legend.position = "bottom") +
geom_point(aes(y=incidence,x=State),
position=dodge, stat="identity",col='red') +
ylab('Incidence') +
geom_hline(yintercept=mean(jagsfit$BUGSoutput$sims.matrix[,3]),col='blue') +
coord_flip()

Plot 2

# http://stackoverflow.com/questions/20434343/color-us-states-according-to-value-in-r

# your values (here as named vector)

library(maps)

values <- (r10$pred-min(r10$pred))/(max(r10$pred)-min(r10$pred))

#values <- (r10$incidence-min(r10$incidence))/

# (max(r10$incidence[r10$incidence<.5])-min(r10$incidence))

names(values) <- r10$state.abb

# getting the names used by map

tmp <- map('state',plot=FALSE,namesonly=TRUE)

# matching (after adjusting using gsub and tolower)

tmp <- match(gsub('(:.*)','',tmp),tolower(r10$State))

map('state',

fill=TRUE,

col=rgb(colorRamp(c('green','red'))(values),max=255)[tmp])

Jags output

Inference for Bugs model at "C:/Users/Kees/AppData/Local/Temp/RtmpwbAFvR/modela3419a366f0.txt", fit using jags,

4 chains, each with 20000 iterations (first 1000 discarded), n.thin = 19

n.sims = 4000 iterations saved

mu.vect sd.vect 2.5% 25% 50%

a 9.732 6.530 2.931 5.662 7.924

b 6207140.792 4125444.443 1898003.649 3609272.959 5094169.482

betamean 0.101 0.008 0.087 0.096 0.101

pp[1] 0.087 0.032 0.032 0.064 0.085

pp[2] 0.123 0.030 0.073 0.103 0.120

pp[3] 0.071 0.026 0.026 0.053 0.070

pp[4] 0.099 0.024 0.058 0.083 0.098

pp[5] 0.130 0.014 0.105 0.120 0.130

pp[6] 0.081 0.023 0.041 0.065 0.079

pp[7] 0.097 0.027 0.050 0.078 0.094

pp[8] 0.130 0.041 0.066 0.101 0.123

pp[9] 0.110 0.017 0.079 0.097 0.109

pp[10] 0.094 0.020 0.059 0.081 0.093

pp[11] 0.079 0.030 0.027 0.058 0.077

pp[12] 0.071 0.026 0.026 0.052 0.069

pp[13] 0.076 0.029 0.024 0.055 0.074

pp[14] 0.177 0.029 0.127 0.158 0.176

pp[15] 0.122 0.027 0.077 0.103 0.120

pp[16] 0.122 0.033 0.069 0.099 0.118

pp[17] 0.109 0.028 0.062 0.089 0.106

pp[18] 0.159 0.037 0.100 0.132 0.154

pp[19] 0.056 0.020 0.022 0.041 0.054

pp[20] 0.087 0.023 0.048 0.071 0.086

pp[21] 0.090 0.031 0.035 0.068 0.087

pp[22] 0.124 0.024 0.083 0.107 0.122

pp[23] 0.080 0.023 0.040 0.063 0.078

pp[24] 0.127 0.028 0.080 0.108 0.124

pp[25] 0.079 0.026 0.032 0.061 0.078

pp[26] 0.083 0.031 0.029 0.062 0.081

pp[27] 0.121 0.023 0.080 0.105 0.119

pp[28] 0.087 0.032 0.029 0.065 0.086

pp[29] 0.092 0.030 0.041 0.072 0.090

pp[30] 0.080 0.031 0.027 0.057 0.078

pp[31] 0.106 0.022 0.068 0.090 0.104

pp[32] 0.108 0.032 0.055 0.086 0.104

pp[33] 0.139 0.037 0.081 0.113 0.135

pp[34] 0.076 0.014 0.051 0.066 0.075

pp[35] 0.097 0.019 0.063 0.084 0.096

pp[36] 0.123 0.031 0.074 0.101 0.120

pp[37] 0.071 0.024 0.029 0.054 0.070

pp[38] 0.094 0.018 0.062 0.081 0.093

pp[39] 0.106 0.035 0.049 0.082 0.102

pp[40] 0.110 0.027 0.063 0.091 0.108

pp[41] 0.086 0.032 0.030 0.064 0.084

pp[42] 0.122 0.026 0.078 0.103 0.119

pp[43] 0.078 0.013 0.055 0.069 0.077

pp[44] 0.081 0.027 0.034 0.062 0.080

pp[45] 0.111 0.023 0.070 0.094 0.109

pp[46] 0.089 0.034 0.033 0.066 0.086

pp[47] 0.081 0.021 0.043 0.066 0.080

pp[48] 0.065 0.021 0.029 0.050 0.064

pp[49] 0.103 0.032 0.051 0.080 0.100

pp[50] 0.090 0.034 0.032 0.066 0.087

pp[51] 0.185 0.063 0.095 0.139 0.174

deviance 242.511 13.874 217.150 232.666 241.477

75% 97.5% Rhat n.eff

a 11.529 28.847 1.068 63

b 7337119.577 18410168.237 1.067 64

betamean 0.106 0.117 1.001 4000

pp[1] 0.106 0.158 1.003 910

pp[2] 0.140 0.190 1.001 2600

pp[3] 0.088 0.125 1.006 460

pp[4] 0.114 0.152 1.001 4000

pp[5] 0.139 0.159 1.002 1800

pp[6] 0.096 0.130 1.005 560

pp[7] 0.112 0.157 1.001 4000

pp[8] 0.151 0.230 1.001 3000

pp[9] 0.120 0.147 1.001 2800

pp[10] 0.107 0.137 1.003 1100

pp[11] 0.096 0.143 1.007 450

pp[12] 0.087 0.126 1.005 550

pp[13] 0.094 0.137 1.007 420

pp[14] 0.196 0.238 1.012 250

pp[15] 0.138 0.182 1.001 4000

pp[16] 0.140 0.200 1.001 2900

pp[17] 0.124 0.171 1.001 4000

pp[18] 0.182 0.242 1.008 350

pp[19] 0.069 0.098 1.008 320

pp[20] 0.102 0.137 1.002 2000

pp[21] 0.108 0.158 1.003 1200

pp[22] 0.139 0.178 1.003 970

pp[23] 0.094 0.128 1.002 1800

pp[24] 0.144 0.190 1.004 660

pp[25] 0.096 0.135 1.006 550

pp[26] 0.102 0.151 1.003 3700

pp[27] 0.135 0.170 1.002 1800

pp[28] 0.106 0.159 1.004 1200

pp[29] 0.110 0.160 1.004 1200

pp[30] 0.099 0.145 1.005 520

pp[31] 0.120 0.154 1.002 1400

pp[32] 0.127 0.179 1.003 1800

pp[33] 0.161 0.224 1.006 450

pp[34] 0.085 0.105 1.006 480

pp[35] 0.108 0.138 1.001 3600

pp[36] 0.141 0.191 1.002 1300

pp[37] 0.087 0.123 1.004 700

pp[38] 0.106 0.133 1.001 4000

pp[39] 0.125 0.184 1.002 4000

pp[40] 0.126 0.171 1.002 4000

pp[41] 0.106 0.154 1.003 1200

pp[42] 0.138 0.180 1.006 470

pp[43] 0.086 0.106 1.006 480

pp[44] 0.098 0.143 1.005 590

pp[45] 0.125 0.161 1.002 2500

pp[46] 0.109 0.164 1.004 700

pp[47] 0.094 0.126 1.004 700

pp[48] 0.080 0.110 1.008 320

pp[49] 0.120 0.173 1.004 790

pp[50] 0.109 0.166 1.003 1400

pp[51] 0.217 0.337 1.012 210

deviance 251.840 271.487 1.039 88

For each parameter, n.eff is a crude measure of effective sample size,

and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)

pD = 92.8 and DIC = 335.3

DIC is an estimate of expected predictive error (lower deviance is better).

Saturday, July 19, 2014

Guns are cool - time effects

September last year I made a post using the shootingtracker data. It is attempted in shootingtracker to register all shootings with at least four victims, be they wounded or dead. The data starts January 1^st2013, which means that by now the amount of data has almost doubled. This surely is a dataset where I hope the makers find less and less data to add. Analysis shows Sundays in summer have the highest number of shootings on a day. Three to four shootings on Sunday in July and August.

Data

Data sits in two pages on shootingtracker, 2013 and 2014. In preparation for this post I copied/pasted those data in notepad and removed headers and footers. In the 2013 data I kept names of columns. The first steps of reading the data are removing some of the (for me) extraneous info, such as reference where the data came from. Subsequently the state and town are separated and a few records which do not have the correct state abbreviation are corrected. Finally, 13 is reformatted to 2013 and date is created. The last record used is from July 9^th, 2014.
r13 <- readLines('raw13.txt')
r14 <- readLines('raw14.txt')
r1 <- c(r13,r14)
head(r1)
[1] "Number\t Date\t Alleged Shooter\t Killed\t Wounded\t Location\t References"
[2] "1\t1/1/13\tCarlito Montoya\t4\t0\tSacramento, CA\t"
[3] " [Expand] "
[4] "2\t1/1/13\tUnknown\t1\t3\tHawthorne, CA\t"
[5] " [Expand] "
[6] "3\t1/1/13\tJulian Sims\t0\t4\tMcKeesport, PA\t"
tail(r1)
[1] "141\t7/8/2014\tUnknown\t1\t4\tSan Bernardino, CA\t"
[2] " [Expand] "
[3] "142\t7/8/2014\tUnknown\t0\t5\tProvidence, RI\t"
[4] " [Expand] "
[5] "143\t7/9/2014\tRonald Lee Haskell\t6\t1\tHouston, TX\t"
[6] " [Expand] "
r2 <- gsub('\\[[a-zA-Z0-9]*\\]','',r1)
r3 <- gsub('^ *$','',r2)
r4 <- r3[r3!='']
r5 <- gsub('\\t$','',r4)
r6 <- gsub('\\t References$','',r5)
r7 <- read.table(textConnection(r6),
sep='\t',
header=TRUE,
stringsAsFactors=FALSE)
r7$Location[r7$Location=='Washington DC'] <-
'Washington, DC'
r8 <- read.table(textConnection(as.character(r7$Location)),
sep=',',
col.names=c('Location','State'),
stringsAsFactors=FALSE)
r8$State <- gsub(' ','',r8$State)
r8$State[r8$State=='Tennessee'] <- 'TN'
r8$State[r8$State=='Ohio'] <- 'OH'
r8$State[r8$State=='Kansas'] <- 'KS'
r8$State[r8$State=='Louisiana'] <- 'LA'
r8$State[r8$State=='Illinois'] <- 'IL'
r8$State <- toupper(r8$State)

r7$State <- r8$State
r7$Location <- r8$Location
r7 <- r7[r7$State != 'PUERTORICO',]
Sys.setlocale(category = "LC_TIME", locale = "C")
r7$Date <- gsub('/13$','/2013',r7$Date)
r7$date <- as.Date(r7$Date,format="%m/%d/%Y")

Effect of day and month

Effect of day of the week is pretty easy to plot, just add the day and run qplot(weekday). In this case complexities arise because I want the days in a specific order, Monday to Sunday. Second is that not all days occur equally often in the test period. This is not enough to invalidate the plot, but since I had to correct for occurrence of months for a similar plot, I decided to reuse that code for weekdays. The data.frame alldays is used to calculate the number of days in the data set. I am not going to over analyze this, Sundays stick out in a negative way.

library(ggplot2)

r7$weekday <- factor(format(r7$date,'%a'),

levels=format(as.Date('07/07/2014',format="%m/%d/%Y")+0:6,'%a'))

r7$month <- factor(format(r7$date,'%b'),

levels=format(as.Date('01/15/2014',format="%m/%d/%Y")+

seq(0,length.out=12,by=30),'%b'))

alldays <- data.frame(

date=seq(min(r7$date),max(r7$date),by=1))

alldays$weekday <- factor(format(alldays$date,'%a'),

levels=levels(r7$weekday))

alldays$month <- factor(format(alldays$date,'%b'),

levels=format(as.Date('01/15/2014',format="%m/%d/%Y")+

seq(0,length.out=12,by=30),'%b'))

ggplot(data=data.frame(prop=as.numeric(table(r7$weekday)/

table(alldays$weekday)),

weekday=factor(levels(r7$weekday),

levels=levels(r7$weekday))),

aes(y=prop,x=weekday)) +

geom_bar(stat='identity') +

ylab('Shootings per day') +

xlab('Day of the week')

In terms of months, it seems summer is worse than the other seasons and winter is best.

ggplot(data=data.frame(prop=as.numeric(table(r7$month)/

table(alldays$month)),

month=factor(levels(r7$month),

levels=levels(r7$month))),

aes(y=prop,x=month)) +

geom_bar(stat='identity') +

ylab('Shootings per day') +

xlab('Month')

A model

It is not obvious, given the unequal distributions of weekdays over months, how significant a month effect is. To examine this, I have reorganized the data to display shootings per day. Data frame alldays is used again, now to ensure data with no shootings are correctly represented. The modeling shows a clear effect of months and the interaction of days and months on the brink of significance.

r7$one <- 1

ag <- aggregate(r7$one,

by=list(date=r7$date),FUN=sum)

counts <- merge(alldays,ag,all=TRUE)

counts$x[is.na(counts$x)] <- 0

g0 <- glm(x ~ weekday ,data=counts,

family='poisson')

g1 <- glm(x ~ weekday + month,data=counts,

family='poisson')

g2 <- glm(x ~ weekday * month,data=counts,

family='poisson')

anova(g0,g1,g2,test='Chisq')

Analysis of Deviance Table

Model 1: x ~ weekday

Model 2: x ~ weekday + month

Model 3: x ~ weekday * month

Resid. Df Resid. Dev Df Deviance Pr(>Chi)

1 548 653.86

2 537 624.86 11 29.008 0.002264 **

3 471 538.99 66 85.870 0.050718 .

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Predictions

To understand the interaction, a plot is made of the expected values by day and month. Looking at this plot, the Sunday effect is most pronounced in Summer. In addition I would not be surprised if next Sunday has four shootings again.

pred1 <- expand.grid(weekday=factor(levels(r7$weekday),

levels=levels(r7$weekday)),

month=factor(levels(r7$month),

levels=levels(r7$month)))

preds <- predict(g2,pred1,

type='response',

se=TRUE)

pred1$fit <- preds$fit

pred1$se.fit <- preds$se.fit

limits <- aes(ymax = fit+se.fit, ymin=fit-se.fit)

p <- ggplot(pred1, aes(fill=weekday, y=fit, x=month))

dodge <- position_dodge(width=0.9)

p + geom_bar(position=dodge, stat="identity") +

geom_errorbar(limits, position=dodge, width=0.25) +

theme(legend.position = "bottom") +

ylab('Shootings per Day') +

guides (fill=guide_legend('Day'))

Follow the example?

One might ask if shootings which have much media attention give cause to copycats. This is not easy to analyze, given that clearly time effects from day and month exist. Besides, which shootings get a lot of media attention? Yet we can look at the number of shootings over time and at least add the shootings with most victims in the plot. The number of victims to be marked has arbitrarily been chosen as 18 or more. In this plot I cannot see the connection.

r7$Victims <- r7$Killed+r7$Wounded

table(r7$Victims)

4 5 6 7 8 9 10 12 13 14 18 19 20 21

322 104 25 26 11 5 4 2 2 2 1 1 1 1

p <- ggplot(counts, aes(y=x,x=date))

p + stat_smooth(span=.11) +

geom_point() +

geom_vline(xintercept=as.numeric(r7$date[r7$Victims>15])) +

ylab('shootings per day')

An ACF

Mostly because making an ACF is integral part of analyzing time series. Here it is, in case anybody doubted a week effect. I chose this fairly long lag, because 10 weeks seemed nice to me.

plot(acf(counts$x,lag.max=70))

Saturday, July 12, 2014

odfweave setup and counting logicals

Two short items in this blogpost. Since it was not obvious how to run odfWeave() in my particular setup, the call I am using. Then there were several people crosstabulating logical vectors, so I wanted to play along, 80 times faster than table().

odfWeave

My particular setup consists of R, 7-zip, libreoffice. Somehow they don't 100% play along when using odfWeave. I had that problem this spring and decided to put my solution in a post at some point. In terms of versions therefore, I had that with my previous versions, and tested that it still runs with my new setup (R 3.1.1, Libreoffice 4.2.5.2). The only loose end, is that odfWeave complains I am re-using a directory, and that I need to empty said directory manually.
# the standard example call that works for me
demoFile <- system.file("examples", "simple.odt", package = "odfWeave")
outputFile <- gsub("simple.odt", "output.odt", demoFile)
odfWeave(demoFile, outputFile,
workDir='C:\\Users\\Kees\\Documents\\tmp',
odfWeaveControl(zipCmd =
c("C:\\Progra~1\\7-Zip\\7z a -tzip $$file$$ . -r",
"C:\\Progra~1\\7-Zip\\7z x -tzip $$file$$ -yr") ))
# removing files
file.remove(dir('C:\\Users\\Kees\\Documents\\tmp',
recursive=TRUE,
full.names=TRUE))

# using a different directory
odfWeave('C:\\Users\\Kees\\Documents\\test\\testcases.odt',
'C:\\Users\\Kees\\Documents\\test\\testout.odt',
workDir='C:\\Users\\Kees\\Documents\\tmp',
odfWeaveControl(zipCmd =
c("C:\\Progra~1\\7-Zip\\7z a -tzip $$file$$ . -r",
"C:\\Progra~1\\7-Zip\\7z x -tzip $$file$$ -yr") ))

Cross table of logical vectors

This was started in Sometimes Table is not the Answer – a Faster 2×2 Table and carried on with Sometimes I feel (some) need for speed. So, I wanted to add my own attempts. The aim is to make a cross table of two logical vectors with a minimum of time. Which becomes important if these vectors are long. Solutions from previous posts.

set.seed(2014)

manual = sample(c(TRUE, FALSE), 10e6, replace = TRUE)

auto = sample(c(TRUE, FALSE), 10e6, replace = TRUE)

logical.tab = function(x, y) {

tt = sum(x & y)

tf = sum(x & !y)

ft = sum(!x & y)

ff = sum(!x & !y)

return(matrix(c(ff, tf, ft, tt), 2, 2))

}

basic.tab2 = function(x, y) {

dif = x - y

tf = sum(dif > 0)

ft = sum(dif < 0)

tt = sum(x*y)

ff = length(dif) - tt - tf - ft

return(c(tf, ft, tt, ff))

}

tabulate(manual + auto *2+1, 4)

My idea was we should use the margins and go back from there.

my.tab = function(x, y) {

tt = sum(x * y)

t1=sum(x)

t2=sum(y)

return(matrix(c(length(x)-t1-t2+tt, t1-tt, t2-tt, tt), 2, 2))

}

my.tab2 <- function(x, y) {

phase1 <- colSums(cbind(x,y,x*y))

return(matrix(c(length(x)-sum(phase1[-3])+phase1[3],

phase1[-3]-phase1[3],

phase1[3]),2,2))

}

With my particular hardware table() is just too slow to microbenchmark often, but 80 times faster than table() is not bad.

library(microbenchmark)

microbenchmark(

logical.tab(manual, auto),

basic.tab2(manual, auto),

my.tab(manual,auto),

my.tab2(manual,auto),

tabulate(manual + auto *2+1, 4),

table(manual,auto),

times = 20)

Unit: milliseconds

expr min lq median uq max neval

logical.tab(manual, auto) 2852.5587 2888.8590 2906.4571 2972.3916 3227.0821 20

basic.tab2(manual, auto) 705.8153 722.5800 746.1683 765.9400 957.5435 20

my.tab(manual, auto) 185.8359 186.6829 188.0988 224.2308 413.5623 20

my.tab2(manual, auto) 463.2731 481.8843 487.7825 512.2563 694.1729 20

tabulate(manual + auto * 2 + 1, 4) 276.1837 300.8009 315.9451 379.7302 534.7997 20

table(manual, auto) 15703.0576 16132.0100 16231.3342 16466.7445 19012.0273 20

Sunday, July 6, 2014

Stone Flakes V, networks again

Last week I tried pcalg. This week deal (Learning Bayesian Networks with Mixed Variables). The aim n this post I want to try something new, a causal graphical model. The aim here is just as much to get myself a feel what these things do as to understand how the stone flakes data fit together.

Data

Data are stone flakes data which I analyzed previously. The first post was clustering, second linking to hominid type, third regression. Together these made for the bulk of a standard analysis. In this new analysis the same starting data is used.
r2 <- read.table('StoneFlakes.txt',header=TRUE,na.strings='?')
r1 <- read.table('annotation.txt',header=TRUE,na.strings='?')
r12 <- merge(r1,r2)
r12$group <- factor(r12$group,labels=c('Lower Paleolithic',
        'Levallois technique',
        'Middle Paleolithic',
        'Homo Sapiens'))
r12$site <- factor(c('other','gravel pit')[r12$site+1])
r12$mat <- factor(c('flint','other')[r12$mat])
r12$lmage <- log10(-r12$age)

Deal

The starting point of this post was to continue/ my analysis of last week. But when I discovered deal could be used to discover the model, repeating last week's analysis was chosen instead. Deal does not have a Vignette, but there is a paper
deal: A Package for Learning Bayesian Networks which helped me very well to get started.

First Model

Initially I wanted to start with a model containing only continuous variables, similar to before, but that threw an error in jointprior(). Hence I added groups as factor. Autosearch() and heuristicsearch() give a lot of output, basically one line for each step. For brievety these are not shown. The good thing about this model is that it has a solution where 'group' is driving other variables.
library(deal)
rfin <- subset(r12,,c(names(r2)[-1],'group'))
rfin <- rfin[complete.cases(rfin),]
rfin.nw <- network(rfin)
rfin.prior <- jointprior(rfin.nw)
Imaginary sample size: 8
rfin.nw <- learn(rfin.nw,rfin,rfin.prior)$nw
rfin.search <- autosearch(rfin.nw,
    rfin,
    rfin.prior,
    trace=FALSE)
plot(rfin.search$nw)

Heuristic is used to further improve the model. In the end the model seems a bit more complex that pcalg, but not unreasonably so.

rfin.heuristic <- heuristic(rfin.search$nw,
    rfin,
    rfin.prior,
    restart=10,
    trace=FALSE,
    trylist=rfin.search$trylist)
plot(rfin.heuristic$nw)

Second model

For brevity I won't be repeating the code. It is all the same except for the data going in, which will be shown. The second model is similar to the first, but the (potential) outliers have been removed. This model looks even more clean.

rfin <- subset(r12,
!(r12$ID %in% c('ms','c','roe','sz','va','arn')),
c(names(r2)[-1],'group'))
rfin <- rfin[complete.cases(rfin),]

Third model

Including all sensible factors is the model I wanted to do. However, it seemed the imaginary sample size grew to 96. It is my experience that higher imaginary sample sizes produce more complex networks and longer run times. The current model seems a bit too complex to my liking. Moving under the recommended number gave runtime errors.

Final model

In the final model (a few are skipped now) a number of simplifications were made. Region is removed as factor, log(-age) as continuous variable. Group has lost Homo Sapiens, since that category had only three records. The model is restricted in the sense that group cannot be the result of other variables. In the plot these are shown as red arrows.
rfin <- subset(r12,
    !(r12$ID %in% c('ms','c','roe','sz','va','arn')),
    c(-ID,-number,-age,-dating,-region,-lmage))
rfin <- rfin[rfin$group !='Homo Sapiens',]
rfin <- rfin[complete.cases(rfin),]
rfin.nw <- network(rfin)
rfin.prior <- jointprior(rfin.nw)
mybanlist <- matrix(
    c(2:11,
        rep(1,10)),ncol=2)
banlist(rfin.nw) <- mybanlist

Conclusion

Deal makes too complex networks for my liking, pcalg cannot use discrete variables. Deal has banned links, a feature which helps. pcalg made more nice plots, but I have the feeling that is relatively easily remedied. Neither gave a model which struck me as a model to continue with. I'll be needing quite some more study to feel comfortable with this kind of models.