Data
Tycho data
This follows the same pattern as last week.r1 <- read.csv('MEASLES_Cases_1909-1982_20140323140631.csv',
na.strings='-',
skip=2)
r2 <- reshape(r1,
varying=names(r1)[-c(1,2)],
v.names='Cases',
idvar=c('YEAR' , 'WEEK'),
times=names(r1)[-c(1,2)],
timevar='STATE',
direction='long')
r2$STATE=factor(r2$STATE)
Population
The absolute counts are less of an interest, relevant is per 1000 or 100 000 inhabitants, hence population data are needed. Luckily these can be found at census.gov. Less lucky, they sit in text files per decade and the decades have slightly different layouts. On top of that, rather than printing all columns next to each other it is two sections. Based on my investigation last week, years 1920 to 1970 are retrieved. Please notice the units are thousands of inhabitants.years <- dir(pattern='+.txt')
pop1 <-
lapply(years,function(x) {
rl <- readLines(x)
print(x)
sp <- grep('^U.S.',rl)
st1 <- grep('^AL',rl)
st2 <- grep('^WY',rl)
rl1 <- rl[c(sp[1]-2,st1[1]:st2[1])]
rl2 <- rl[c(sp[2]-2,st1[2]:st2[2])]
read1 <- function(rlx) {
rlx[1] <- paste('abb',rlx[1])
rlx <- gsub(',','',rlx,fixed=TRUE)
rt <- read.table(textConnection(rlx),header=TRUE)
rt[,grep('census',names(rt),invert=TRUE)]
}
rr <- merge(read1(rl1),read1(rl2))
ll <- reshape(rr,
list(names(rr)[-1]),
v.names='pop',
timevar='YEAR',
idvar='abb',
times=as.integer(gsub('X','',names(rr)[-1])),
direction='long')
})
[1] "st2029ts.txt"
[1] "st3039ts.txt"
[1] "st4049ts.txt"
[1] "st5060ts.txt"
[1] "st6070ts.txt"
pop <- do.call(rbind,pop1)
Glue between datasets
It is not shown in the printout above, but the census data contains states as 2 character abbreviations, while the Tycho data has states as uppercase texts with spaces replaced by dots, because they started out as variables. Some glue is needed. This can be done via the states data in the datasets package. DC is added manually, since it was not in the states data, but was present in both source data sets. Incidence is Cases/pop, and has as units cases per 1000 inhabitants per week. Variable STATE has served its purpose, so is removed.states <- rbind(
data.frame(
abb=datasets::state.abb,
State=datasets::state.name),
data.frame(abb='DC',
State='District of Columbia'))
states$STATE=gsub(' ','.',toupper(states$State))
r3 <- merge(r2,states)
r4 <- merge(r3,pop)
r4$incidence <- r4$Cases/r4$pop
r5 <- subset(r4,r4$YEAR>1927,-STATE)
head(r5)
YEAR abb WEEK Cases State pop incidence
20434 1928 AL 44 6 Alabama 2640 0.002272727
20435 1928 AL 27 45 Alabama 2640 0.017045455
20436 1928 AL 47 22 Alabama 2640 0.008333333
20437 1928 AL 26 106 Alabama 2640 0.040151515
20438 1928 AL 30 66 Alabama 2640 0.025000000
20439 1928 AL 18 251 Alabama 2640 0.095075758
Those who think R has a memory problem may want to delete some data here (r2, r3), see end of post. On the other hand, the objects are not that large.
Plots
Data are now on comparable scales, so I tried boxplots. By 1966 it seems measles was under control.library(ggplot2)
ggplot(with(r5,aggregate(incidence,list(Year=YEAR,State=State),
function(x) mean(x,na.rm=TRUE))),
aes(factor(Year), x)) +
geom_boxplot(notch = TRUE) +
ylab('Incidence registered Measles Cases per week per 1000') +
theme(text=element_text(family='Arial')) +
scale_x_discrete(labels=
ifelse(levels(factor(r5$YEAR)) %in%
seq(1920,1970,by=10),
levels(factor(r5$YEAR)),
''))
The pattern within the years is still clearly visible. A slow increase from week 38 to week 20 of the next year. Then a fast decrease in the remaining 18 weeks.
ggplot(r5[r5$YEAR<1963,],
aes(factor(WEEK), incidence)) +
ylab('Incidence registered Measles Cases per week per 1000') +
ggtitle('Measles 1928-1962')+
theme(text=element_text(family='Arial')) +
geom_boxplot(notch = TRUE) +
scale_x_discrete(labels=
ifelse(levels(factor(r5$WEEK)) %in%
seq(5,55,by=5),
levels(factor(r5$WEEK)),
'')) +
scale_y_log10()
References
Willem G. van Panhuis, John Grefenstette, Su Yon Jung, Nian Shong Chok, Anne Cross, Heather Eng, Bruce Y Lee, Vladimir Zadorozhny, Shawn Brown, Derek Cummings, Donald S. Burke. Contagious Diseases in the United States from 1888 to the present. NEJM 2013; 369(22): 2152-2158.Deleting data?
These are not the smallest objects;#https://gist.github.com/jedifran/1447362
.ls.objects <- function (pos = 1, pattern, order.by,
decreasing=FALSE, head=FALSE, n=5) {
napply <- function(names, fn) sapply(names, function(x)
fn(get(x, pos = pos)))
names <- ls(pos = pos, pattern = pattern)
obj.class <- napply(names, function(x) as.character(class(x))[1])
obj.mode <- napply(names, mode)
obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
obj.prettysize <- napply(names, function(x) {
capture.output(print(object.size(x), units = "auto")) })
obj.size <- napply(names, object.size)
obj.dim <- t(napply(names, function(x)
as.numeric(dim(x))[1:2]))
vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
obj.dim[vec, 1] <- napply(names, length)[vec]
out <- data.frame(obj.type, obj.size, obj.prettysize, obj.dim)
names(out) <- c("Type", "Size", "PrettySize", "Rows", "Columns")
if (!missing(order.by))
out <- out[order(out[[order.by]], decreasing=decreasing), ]
if (head)
out <- head(out, n)
out
}
# shorthand
lsos <- function(..., n=10) {
.ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
}
lsos()
Type Size PrettySize Rows Columns
r2 data.frame 20879368 19.9 Mb 231660 4
r4 data.frame 4880608 4.7 Mb 135233 8
r3 data.frame 4737864 4.5 Mb 196911 6
r5 data.frame 4140816 3.9 Mb 114800 7
r1 data.frame 965152 942.5 Kb 3861 62
pop1 list 204752 200 Kb 5 NA
pop data.frame 182312 178 Kb 2592 3
states data.frame 11144 10.9 Kb 51 3
lsos function 5400 5.3 Kb NA NA
years character 368 368 bytes 5 NA
But how large is 20 Mb these days? I have 4 Gb of RAM. At the end of making this post 1.5 Gb is used.
No comments:
Post a Comment