Saturday, July 26, 2014

Guns are Cool - States

Last week I looked at time effects of the shootingtracker database. This week I will look at the states. Some (smaller) states never made it on the database. Other states, far too frequently. The worst of these California. After correcting for population size, Washington DC with 0.8 shootings per year per 100000 inhabitants sticks out. Subsequently, the observed incidences have been shrunk towards the common distribution. In these posterior estimates Massachusetts is best while Washington DC, Illinois and Louisiana are worst.

Data

Data are from the shootingtracker database and have been updated July 25. The last entry in the database is July 25. The script is a slightly adapted version from last week. Most notably, I did not notice last week the database uses KS for Kansas, but KA is what my other data uses. Note that in order make this post readable, all code is at a separate section.

Adding state information

The first thing is to get one record per state. The datasets package has state information, while number of inhabitants (2013 data) is plucked from census.gov. After merging some states have missing counts for shootings absence in the shootingtracker database is taken as 0 shootings.
Incidence is calculated per 100 000 inhabitants per year (365 days).

Calculating posteriors

Assuming a beta prior and binomial distribution for number of shootings, Jags is used to estimate posteriors. The one noticeable thing in the model is the prior for b. b really wants to be high, hence the prior is quite a bit wider than usual. The current values seem to be acceptable, see also the output at the bottom of this post.
The plot shows a number of things. In red are the observed values. In black the posterior values; means and 90% intervals of the samples. The blue line is the value to which the data is shrunk. Washington DC is shrunk quite a bit, because it has relatively few people and a large value. The same is true for Delaware. Illinois on the other hand has many inhabitants, hence is shrunk much less. It also has a more narrow interval. On the other side, states like Wyoming are shrunk upwards. Again, more inhabitants means less shrinking, Massachusetts ends up as having least number of shootings. Really big states, New York, Texas and California, are not shrunk at all and have quite narrow intervals. Texas and NY on the good side, California on the bad side. 



The map shows the posteriors in a different way. At this point it might be worth noting that the north east is actually doing quite well, but since these states have so little inhabitants they are shrunk upwards and look bad. The current approach does not provide a natural way to avoid this. In a year or two additional data may cause at least some change in posteriors of the smaller states, both in the top and the bottom of this list.

Code

reading data

r13 <- readLines('raw13.txt')
r14 <- readLines('raw14.txt')
r1 <- c(r13,r14)
r2 <- gsub('\\[[a-zA-Z0-9]*\\]','',r1)
r3 <- gsub('^ *$','',r2)
r4 <- r3[r3!='']
r5 <- gsub('\\t$','',r4)
r6 <- gsub('\\t References$','',r5)
r7 <- read.table(textConnection(r6),
    sep='\t',
    header=TRUE,
    stringsAsFactors=FALSE)
r7$Location[r7$Location=='Washington DC'] <-
    'WashingtonDC, DC'
r8 <- read.table(textConnection(as.character(r7$Location)),
    sep=',',
    col.names=c('Location','State'),
    stringsAsFactors=FALSE)
r8$State <- gsub(' ','',r8$State)
r8$State[r8$State=='Tennessee'] <- 'TN'
r8$State[r8$State=='Ohio'] <- 'OH'
r8$State[r8$State %in% c('Kansas','KA')] <- 'KS'
r8$State[r8$State=='Louisiana'] <- 'LA'
r8$State[r8$State=='Illinois'] <- 'IL'
r8$State <- toupper(r8$State)
r7$StateAbb <- r8$State
r7$Location <- r8$Location
r7 <- r7[! (r7$State %in% c( 'PUERTORICO','PR')),]
r7$Date <- gsub('/13$','/2013',r7$Date)
r7$date <- as.Date(r7$Date,format="%m/%d/%Y")

state information

r7$one <- 1
agg1 <- with(r7,
    aggregate(one,by=list(StateAbb=StateAbb),FUN=sum))
library(datasets)
states <- data.frame(StateAbb=as.character(state.abb),
    StateArea=state.area,
    State=as.character(state.name),
    StateRegion=state.region,
    Frost=state.x77[,'Frost']
)
# http://www.census.gov/popest/data/state/totals/2013/index.html
# data pre processed so the csv can be used as is.
inhabitants <- read.csv('NST-EST2013-01.treated.csv')
#put it all together
states <- rbind(states,data.frame(StateAbb='DC',
        StateArea=68.3,
        State='District of Columbia',
       StateRegion= 'South',Frost=NA))
r9 <- merge(states,inhabitants)
r10 <- merge(r9,agg1,all=TRUE)
# No data in some states
r10$x[is.na(r10$x)] <- 0
r10$incidence <- 100000*r10$x/r10$Population*365/

        as.numeric((max(r7$date)-min(r7$date)))

Jags model


library(R2jags)
datain <- list(
    count=r10$x,
    Population = r10$Population,
    n=nrow(r10),
    scale=100000*365/
        as.numeric((max(r7$date)-min(r7$date))))

model1 <- function() {
  for (i in 1:n) {
    count[i] ~ dbin(p[i],Population[i])
    p[i] ~ dbeta(a,b)
    pp[i] <- p[i]*scale
  }
  a ~ dnorm(0,.000001) %_% T(0,)
  b ~ dnorm(1e7,.00000000000001) %_% T(0,)
  betamean <- scale * a/(a+b)
}

parameters <- c('a','b','betamean','pp')
jagsfit <- jags(datain, 
    model=model1, 
    parameters=parameters,
    n.iter=20000,
    n.burnin=1000,
    n.chain=4,
    progress.bar="gui")
jagsfit

Plot 1

samples <- jagsfit$BUGSoutput$sims.matrix[,5:55]
r10$pred <- apply(samples,2,mean)
r10$l05 <- apply(samples,2,function(x) quantile(x,.05))
r10$l95 <- apply(samples,2,function(x) quantile(x,.95))
library(ggplot2)
r11 <- subset(r10,,c(State,l05,l95,pred,incidence))
r11 <- r11[order(r11$pred),]
r11$State <- factor(r11$State,levels=r11$State)
p <- ggplot(r11, aes(y=pred, x=State))
limits <- aes(ymin =l05, ymax=l95)
dodge <- position_dodge(width=0.9)
p +geom_point(position=dodge, stat="identity") +
geom_errorbar(limits, position=dodge, width=0.25) +
theme(legend.position = "bottom") +
geom_point(aes(y=incidence,x=State),
position=dodge, stat="identity",col='red') +
ylab('Incidence') +
geom_hline(yintercept=mean(jagsfit$BUGSoutput$sims.matrix[,3]),col='blue') +
coord_flip()

Plot 2

# http://stackoverflow.com/questions/20434343/color-us-states-according-to-value-in-r
# your values (here as named vector)

library(maps)
values <- (r10$pred-min(r10$pred))/(max(r10$pred)-min(r10$pred))
#values <- (r10$incidence-min(r10$incidence))/
#    (max(r10$incidence[r10$incidence<.5])-min(r10$incidence))

names(values) <- r10$state.abb
# getting the names used by map
tmp <- map('state',plot=FALSE,namesonly=TRUE)
# matching (after adjusting using gsub and tolower)
tmp <- match(gsub('(:.*)','',tmp),tolower(r10$State))
map('state',
    fill=TRUE,
    col=rgb(colorRamp(c('green','red'))(values),max=255)[tmp])

Jags output

Inference for Bugs model at "C:/Users/Kees/AppData/Local/Temp/RtmpwbAFvR/modela3419a366f0.txt", fit using jags,
 4 chains, each with 20000 iterations (first 1000 discarded), n.thin = 19
 n.sims = 4000 iterations saved
             mu.vect     sd.vect        2.5%         25%         50%
a              9.732       6.530       2.931       5.662       7.924
b        6207140.792 4125444.443 1898003.649 3609272.959 5094169.482
betamean       0.101       0.008       0.087       0.096       0.101
pp[1]          0.087       0.032       0.032       0.064       0.085
pp[2]          0.123       0.030       0.073       0.103       0.120
pp[3]          0.071       0.026       0.026       0.053       0.070
pp[4]          0.099       0.024       0.058       0.083       0.098
pp[5]          0.130       0.014       0.105       0.120       0.130
pp[6]          0.081       0.023       0.041       0.065       0.079
pp[7]          0.097       0.027       0.050       0.078       0.094
pp[8]          0.130       0.041       0.066       0.101       0.123
pp[9]          0.110       0.017       0.079       0.097       0.109
pp[10]         0.094       0.020       0.059       0.081       0.093
pp[11]         0.079       0.030       0.027       0.058       0.077
pp[12]         0.071       0.026       0.026       0.052       0.069
pp[13]         0.076       0.029       0.024       0.055       0.074
pp[14]         0.177       0.029       0.127       0.158       0.176
pp[15]         0.122       0.027       0.077       0.103       0.120
pp[16]         0.122       0.033       0.069       0.099       0.118
pp[17]         0.109       0.028       0.062       0.089       0.106
pp[18]         0.159       0.037       0.100       0.132       0.154
pp[19]         0.056       0.020       0.022       0.041       0.054
pp[20]         0.087       0.023       0.048       0.071       0.086
pp[21]         0.090       0.031       0.035       0.068       0.087
pp[22]         0.124       0.024       0.083       0.107       0.122
pp[23]         0.080       0.023       0.040       0.063       0.078
pp[24]         0.127       0.028       0.080       0.108       0.124
pp[25]         0.079       0.026       0.032       0.061       0.078
pp[26]         0.083       0.031       0.029       0.062       0.081
pp[27]         0.121       0.023       0.080       0.105       0.119
pp[28]         0.087       0.032       0.029       0.065       0.086
pp[29]         0.092       0.030       0.041       0.072       0.090
pp[30]         0.080       0.031       0.027       0.057       0.078
pp[31]         0.106       0.022       0.068       0.090       0.104
pp[32]         0.108       0.032       0.055       0.086       0.104
pp[33]         0.139       0.037       0.081       0.113       0.135
pp[34]         0.076       0.014       0.051       0.066       0.075
pp[35]         0.097       0.019       0.063       0.084       0.096
pp[36]         0.123       0.031       0.074       0.101       0.120
pp[37]         0.071       0.024       0.029       0.054       0.070
pp[38]         0.094       0.018       0.062       0.081       0.093
pp[39]         0.106       0.035       0.049       0.082       0.102
pp[40]         0.110       0.027       0.063       0.091       0.108
pp[41]         0.086       0.032       0.030       0.064       0.084
pp[42]         0.122       0.026       0.078       0.103       0.119
pp[43]         0.078       0.013       0.055       0.069       0.077
pp[44]         0.081       0.027       0.034       0.062       0.080
pp[45]         0.111       0.023       0.070       0.094       0.109
pp[46]         0.089       0.034       0.033       0.066       0.086
pp[47]         0.081       0.021       0.043       0.066       0.080
pp[48]         0.065       0.021       0.029       0.050       0.064
pp[49]         0.103       0.032       0.051       0.080       0.100
pp[50]         0.090       0.034       0.032       0.066       0.087
pp[51]         0.185       0.063       0.095       0.139       0.174
deviance     242.511      13.874     217.150     232.666     241.477
                 75%        97.5%  Rhat n.eff
a             11.529       28.847 1.068    63
b        7337119.577 18410168.237 1.067    64
betamean       0.106        0.117 1.001  4000
pp[1]          0.106        0.158 1.003   910
pp[2]          0.140        0.190 1.001  2600
pp[3]          0.088        0.125 1.006   460
pp[4]          0.114        0.152 1.001  4000
pp[5]          0.139        0.159 1.002  1800
pp[6]          0.096        0.130 1.005   560
pp[7]          0.112        0.157 1.001  4000
pp[8]          0.151        0.230 1.001  3000
pp[9]          0.120        0.147 1.001  2800
pp[10]         0.107        0.137 1.003  1100
pp[11]         0.096        0.143 1.007   450
pp[12]         0.087        0.126 1.005   550
pp[13]         0.094        0.137 1.007   420
pp[14]         0.196        0.238 1.012   250
pp[15]         0.138        0.182 1.001  4000
pp[16]         0.140        0.200 1.001  2900
pp[17]         0.124        0.171 1.001  4000
pp[18]         0.182        0.242 1.008   350
pp[19]         0.069        0.098 1.008   320
pp[20]         0.102        0.137 1.002  2000
pp[21]         0.108        0.158 1.003  1200
pp[22]         0.139        0.178 1.003   970
pp[23]         0.094        0.128 1.002  1800
pp[24]         0.144        0.190 1.004   660
pp[25]         0.096        0.135 1.006   550
pp[26]         0.102        0.151 1.003  3700
pp[27]         0.135        0.170 1.002  1800
pp[28]         0.106        0.159 1.004  1200
pp[29]         0.110        0.160 1.004  1200
pp[30]         0.099        0.145 1.005   520
pp[31]         0.120        0.154 1.002  1400
pp[32]         0.127        0.179 1.003  1800
pp[33]         0.161        0.224 1.006   450
pp[34]         0.085        0.105 1.006   480
pp[35]         0.108        0.138 1.001  3600
pp[36]         0.141        0.191 1.002  1300
pp[37]         0.087        0.123 1.004   700
pp[38]         0.106        0.133 1.001  4000
pp[39]         0.125        0.184 1.002  4000
pp[40]         0.126        0.171 1.002  4000
pp[41]         0.106        0.154 1.003  1200
pp[42]         0.138        0.180 1.006   470
pp[43]         0.086        0.106 1.006   480
pp[44]         0.098        0.143 1.005   590
pp[45]         0.125        0.161 1.002  2500
pp[46]         0.109        0.164 1.004   700
pp[47]         0.094        0.126 1.004   700
pp[48]         0.080        0.110 1.008   320
pp[49]         0.120        0.173 1.004   790
pp[50]         0.109        0.166 1.003  1400
pp[51]         0.217        0.337 1.012   210
deviance     251.840      271.487 1.039    88

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 92.8 and DIC = 335.3
DIC is an estimate of expected predictive error (lower deviance is better).

No comments:

Post a Comment