Sunday, August 3, 2014

Guns are Cool - Differences between states

Last week my blog showed that there are differences between states in the shootingtracker database. This week it is attempted to understand why states are different. A number of variables were extracted from a few sources, among which gun laws, % gun owners and a few more general variables. It appeared that among the examined variables, % highschool or higher and % Gun Owners are most important. The latter has a markedly non-linear relation, worst amount of guns is around 30%.

Data

Shootingtracker data is same as last post.
Gun laws were difficult to capture. Laws are made full of exemptions and details, where statistics loves having variables with a few levels. Wikipedia has a table for each state, and the rules listed vary with each state. Luckily I also found a summary from Nick Leghorn containing 2011 laws in a nice tabular format. Some additional coding has been done to make this better usable. This coding categorized to Yes, No and Other.
Percentage guns is another difficult to find variable. A lucky find was on about.com where 2012 data could be found.
As additional variables, States IN Profile was used to extract income per capita and education information. For this a full dataset excel sheet was downloaded, after which data was selected into a much smaller version, by among others removing all but the latest year in each sheet, via editing in LibreOffice.
All in all there were quite a number of choices made to obtain the data. I feel quite certain different choices would bring about different analysis results.

Analysis

Three analysis methods were used. RandomForest because it allows easy mixing of variables and factors. I find it usually gives me a good idea which variables are of influence. Generalized linear regression, because it should be a default approach. Generalized Additive Model because it can handle non linearity in continuous variables without getting back to the standard polynomials. I do not really believe in a world with only linear relations, that is just too simple.

RandomForest

Random Forest shows that percentage High School (PerHSplus) is the most important variable, followed by the percentage gun owners (PercGunOwner). The gun laws are at the bottom of the list.

Generalized Linear Regression

Generalized linear regression showed again PercGunOwner and PerHSplus as most important variables. Though both have a p value of approximately 0.09, which is not strong.
Analysis of Deviance Table (Type II tests)

Response: cbind(x, Population - x)
LR Chisq Df Pr(>Chisq)
PermitPurchHandG  0.66719  1    0.41403
PermitPurchRifle  0.07659  1    0.78198
PermitPossess     1.32612  2    0.51527
Register          0.13776  2    0.93344
AssualtWBan       2.31216  2    0.31472
WaitingPeriod     0.16775  2    0.91955
Other             1.26820  1    0.26011
PercGunOwner      2.80036  1    0.09424 .
income            1.29039  1    0.25598
PerHSplus         2.77038  1    0.09602 .
PerBAplus         1.83276  1    0.17580
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The GAM package allows some model selection, which I decided to use. Only the continuous variables are used, but then the factors, which encompassed the gun laws, did not really show influence in previous analysis. It appears both PercGunOwner and PerHSplus have a non-linear relation with shootings.
Call:
gam(formula = cbind(x, Population - x) ~
s(PercGunOwner, df = 2) +
s(PerHSplus, df = 2),
family = binomial, data = r13, trace = FALSE)

Degrees of Freedom: 48 total; 43.99996 Residual
Residual Deviance: 85.09242

Plot

The figure shows a marked curvature in gun ownership. Worst is less education and medium level of gun ownership. Best is high education, where gun ownership does not matter very much.

Discussion

Getting three times the same variables is a good indication that, within the data used, these are the variables of relevance. However, it does not preclude that other variables are relevant.
It is odd that gun laws did not have influence. However, a number of explanations may be found. The variables may not express the laws essential properties. Gun laws may not be effective, at least not the range in laws within the USA. The relation may be different than 'laws reduce shootings', for instance: 'shootings cause laws'. To make that more explicit: Why would a law maker in Montana, with high % gun owners and no shootings make a gun law? It won't get voters. On the other hand, the amount of shootings last week's post had in DC shootings indicates something needs to be done.
A low level of education having more shootings makes some kind of sense. It might be a direct relation, but just as probable an indicator for a whole number of related variables.
The relation with ownership is interesting. Over some percentage of ownership shootings decrease. Whether there is a common cultural factor influencing both or a direct causation cannot be stated based on these data. The relation less guns gives less shootings under 25% ownership is at least a bit more intuitively logical.

Code

Reading the shootingtracker data has not changed since last week.
r1 <- c(r13,r14)
r2 <- gsub('\\[[a-zA-Z0-9]*\\]','',r1)
r3 <- gsub('^ *\$','',r2)
r4 <- r3[r3!='']
r5 <- gsub('\\t\$','',r4)
r6 <- gsub('\\t References\$','',r5)
sep='\t',
stringsAsFactors=FALSE)
r7\$Location[r7\$Location=='Washington DC'] <-
'WashingtonDC, DC'
sep=',',
col.names=c('Location','State'),
stringsAsFactors=FALSE)
r8\$State <- gsub(' ','',r8\$State)
r8\$State[r8\$State=='Tennessee'] <- 'TN'
r8\$State[r8\$State=='Ohio'] <- 'OH'
r8\$State[r8\$State %in% c('Kansas','KA')] <- 'KS'
r8\$State[r8\$State=='Louisiana'] <- 'LA'
r8\$State[r8\$State=='Illinois'] <- 'IL'
r8\$State <- toupper(r8\$State)
table(r8\$State)
r7\$StateAbb <- r8\$State
r7\$Location <- r8\$Location
r7 <- r7[! (r7\$State %in% c( 'PUERTORICO','PR')),]
r7\$Date <- gsub('/13\$','/2013',r7\$Date)
r7\$date <- as.Date(r7\$Date,format="%m/%d/%Y")

r7\$one <- 1
agg1 <- with(r7,
aggregate(one,by=list(StateAbb=StateAbb),FUN=sum))
library(datasets)
states <- data.frame(StateAbb=as.character(state.abb),
StateArea=state.area,
State=as.character(state.name),
StateRegion=state.region,
Frost=state.x77[,'Frost']
)
# http://www.census.gov/popest/data/state/totals/2013/index.html
#put it all together
states <- rbind(states,data.frame(StateAbb='DC',
StateArea=68.3,
State='District of Columbia',
StateRegion= 'South',Frost=NA))
r9 <- merge(states,inhabitants)
r10 <- merge(r9,agg1,all=TRUE)
# No data in some states
r10\$x[is.na(r10\$x)] <- 0

r10\$incidence <- 100000*r10\$x/r10\$Population*365/
as.numeric((max(r7\$date)-min(r7\$date)))

Gun laws

library(XML)
stringsAsFactors = FALSE)
laws <- rh[[1]]
names(laws) <- c('State',
'PermitPurchHandG',
'PermitPurchRifle',
'PermitPossess',
'Register',
'AssualtWBan',
'WaitingPeriod',
'Other')
for (i in 2:(ncol(laws)-1)) {
laws[!(laws[,i] %in% c('Yes','No')),i] <- 'Other'
laws[,i] <- factor(laws[,i],levels=c('Yes','No','Other'))
}
laws[!(laws[,8] %in% c('None')),8] <- 'Other'
laws[,8] <- factor(laws[,8],levels=c('None','Other'))
r11 <- merge(r10,laws)

Gun ownership

lines1 <- gsub('\\(.*\\)','',lines1)
lines1 <- gsub('^[0-9]*. ','',lines1)
lines1 <- gsub('% *\$','',lines1)
lines1 <- gsub(' - ',',',lines1,fixed=TRUE)
r12 <- merge(r11,

General/states in profile

library(xlsx)
sheetName='Ed Attainment (ACS 2008)')
sheet1 <- sheet1[sheet1\$state != 'US',]
sheet1x <- data.frame(StateAbb=sheet1\$state,
PerHSplus = as.numeric(as.character(sheet1\$X._Receiving_HS_Diploma_or_more)),
PerBAplus = as.numeric(as.character(sheet1\$Total_Completing_BA_Degree_or_More)))
sheetName='BEA Per Capita Income')
sheet2 <- sheet2[sheet2\$stabb != 'US',]
sheet2x <- data.frame(StateAbb=factor(sheet2\$stabb),
income=as.numeric(as.character(sheet2\$data)))
sheets <- merge(merge(sheet1x,sheet2x),sheet1x)
r13 <- merge(r12,sheets)

Analysis

Random forest

library(randomForest)
rf1 <- randomForest(incidence ~
PermitPurchHandG + PermitPurchRifle +
PermitPossess + Register +
AssualtWBan + WaitingPeriod +
Other +
PercGunOwner +
income + PerHSplus + PerBAplus,
data=r13,
importance= TRUE)
varImpPlot(rf1)

GLM

library(car)
g1 <- glm(cbind(x,Population-x) ~
PermitPurchHandG + PermitPurchRifle +
PermitPossess + Register +
AssualtWBan + WaitingPeriod +
Other +
PercGunOwner +
income + PerHSplus + PerBAplus,
data=r13,
family=binomial)
Anova(g1)

GAM

g.obj <- gam(cbind(x,Population-x) ~
s(PercGunOwner,df=2) +
s(income,df=2) +
s(PerHSplus,df=2) +
s(PerBAplus,df=2),
data=r13,
family=binomial)
g.step <- step.gam(g.obj,scope=list(
"PercGunOwner"=~1  + PercGunOwner + s(PercGunOwner,df=2) + s(PercGunOwner,df=3),
"income"      =~1  + income       + s(income,df=2)       + s(income,df=3),
"PerHSplus"   =~1  + PerHSplus    + s(PerHSplus,df=2)    + s(PerHSplus,df=3),
"PerBAplus"   =~1  + PerBAplus    + s(PerBAplus,df=2)    + s(PerBAplus,df=3))
)

g.step

Plot

series <- function(x,n) seq(min(x),max(x),length.out=n)
topred <- expand.grid(
PercGunOwner=series(r13\$PercGunOwner,11),
PerHSplus=series(r13\$PerHSplus,12))
preds <- predict(g.step,topred,type='response')
preds <- preds*100000*365/
as.numeric(max(r7\$date)-min(r7\$date))
png('contour1.png')
contour(x=series(r13\$PercGunOwner,11),
y=series(r13\$PerHSplus,12),
z=preds,
xlab='% Gun Owner',
ylab='% High School Plus')
text(x=r13\$PercGunOwner,
y=r13\$PerHSplus,
labels=paste(r13\$StateAbb,format(r13\$incidence,digits=1)),
cex=.65)
dev.off()