Sunday, April 6, 2014

Looking at Measles Data in Project Tycho, part II

Continuing from last week, I will now look at incidence rates of measles in the US. To recap, Project Tycho contains data from all weekly notifiable disease reports for the United States dating back to 1888. These data are freely available to anybody interested.

Data

Tycho data

This follows the same pattern as last week.
r1 <- read.csv('MEASLES_Cases_1909-1982_20140323140631.csv',
    na.strings='-',
    skip=2)
r2 <- reshape(r1,
    varying=names(r1)[-c(1,2)],
    v.names='Cases',
    idvar=c('YEAR' , 'WEEK'),
    times=names(r1)[-c(1,2)],
    timevar='STATE',
    direction='long')
r2$STATE=factor(r2$STATE)

Population

The absolute counts are less of an interest, relevant is per 1000 or 100 000 inhabitants, hence population data are needed. Luckily these can be found at census.gov. Less lucky, they sit in text files per decade and the decades have slightly different layouts. On top of that, rather than printing all columns next to each other it is two sections. Based on my investigation last week, years 1920 to 1970 are retrieved. Please notice the units are thousands of inhabitants.
years <- dir(pattern='+.txt')
pop1 <-
    lapply(years,function(x) {
            rl <- readLines(x)
            print(x)
            sp <- grep('^U.S.',rl)
            st1 <- grep('^AL',rl)
            st2 <- grep('^WY',rl)
            rl1 <- rl[c(sp[1]-2,st1[1]:st2[1])]
            rl2 <- rl[c(sp[2]-2,st1[2]:st2[2])]
           
            read1 <- function(rlx) {
                rlx[1] <- paste('abb',rlx[1])
                rlx <- gsub(',','',rlx,fixed=TRUE)
                rt <- read.table(textConnection(rlx),header=TRUE)
                rt[,grep('census',names(rt),invert=TRUE)]
            }
            rr <- merge(read1(rl1),read1(rl2))
            ll <- reshape(rr,
                list(names(rr)[-1]),
                v.names='pop',
                timevar='YEAR',
                idvar='abb',
                times=as.integer(gsub('X','',names(rr)[-1])),
                direction='long')
        })

[1] "st2029ts.txt"
[1] "st3039ts.txt"
[1] "st4049ts.txt"
[1] "st5060ts.txt"
[1] "st6070ts.txt"

pop <- do.call(rbind,pop1)

Glue between datasets

It is not shown in the printout above, but the census data contains states as 2 character abbreviations, while the Tycho data has states as uppercase texts with  spaces replaced by dots, because they started out as variables. Some glue is needed. This can be done via the states data in the datasets package. DC is added manually, since it was not in the states data, but was present in both source data sets. Incidence is Cases/pop, and has as units cases per 1000 inhabitants per week. Variable STATE has served its purpose, so is removed.
states <- rbind(
    data.frame(
        abb=datasets::state.abb,
        State=datasets::state.name),
    data.frame(abb='DC',
        State='District of Columbia'))
states$STATE=gsub(' ','.',toupper(states$State))

r3 <- merge(r2,states)
r4 <- merge(r3,pop)
r4$incidence <- r4$Cases/r4$pop
r5 <- subset(r4,r4$YEAR>1927,-STATE)
head(r5)

      YEAR abb WEEK Cases   State  pop   incidence
20434 1928  AL   44     6 Alabama 2640 0.002272727
20435 1928  AL   27    45 Alabama 2640 0.017045455
20436 1928  AL   47    22 Alabama 2640 0.008333333
20437 1928  AL   26   106 Alabama 2640 0.040151515
20438 1928  AL   30    66 Alabama 2640 0.025000000
20439 1928  AL   18   251 Alabama 2640 0.095075758

Those who think R has a memory problem may want to delete some data here (r2, r3), see end of post. On the other hand, the objects are not that large.

Plots

Data are now on comparable scales, so I tried boxplots. By 1966 it seems measles was under control.

library(ggplot2)
ggplot(with(r5,aggregate(incidence,list(Year=YEAR,State=State),
                function(x) mean(x,na.rm=TRUE))),
        aes(factor(Year), x)) +
    geom_boxplot(notch = TRUE) +
    ylab('Incidence registered Measles Cases per week per 1000') +
    theme(text=element_text(family='Arial')) +
    scale_x_discrete(labels=
            ifelse(levels(factor(r5$YEAR)) %in%
                    seq(1920,1970,by=10),
                levels(factor(r5$YEAR)),
                ''))

The pattern within the years is still clearly visible. A slow increase from week 38 to week 20 of the next year. Then a fast decrease in the remaining 18 weeks.

ggplot(r5[r5$YEAR<1963,],
        aes(factor(WEEK), incidence)) +
    ylab('Incidence registered Measles Cases per week per 1000') +
    ggtitle('Measles 1928-1962')+
    theme(text=element_text(family='Arial')) +
    geom_boxplot(notch = TRUE) +
    scale_x_discrete(labels=
            ifelse(levels(factor(r5$WEEK)) %in%
                    seq(5,55,by=5),
                levels(factor(r5$WEEK)),
                ''))    +
    scale_y_log10()

References

Willem G. van Panhuis, John Grefenstette, Su Yon Jung, Nian Shong Chok, Anne Cross, Heather Eng, Bruce Y Lee, Vladimir Zadorozhny, Shawn Brown, Derek Cummings, Donald S. Burke. Contagious Diseases in the United States from 1888 to the present. NEJM 2013; 369(22): 2152-2158.

Deleting data?

These are not the smallest objects;
#https://gist.github.com/jedifran/1447362
.ls.objects <- function (pos = 1, pattern, order.by,
    decreasing=FALSE, head=FALSE, n=5) {
    napply <- function(names, fn) sapply(names, function(x)
                fn(get(x, pos = pos)))
    names <- ls(pos = pos, pattern = pattern)
    obj.class <- napply(names, function(x) as.character(class(x))[1])
    obj.mode <- napply(names, mode)
    obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
    obj.prettysize <- napply(names, function(x) {
            capture.output(print(object.size(x), units = "auto")) })
    obj.size <- napply(names, object.size)
    obj.dim <- t(napply(names, function(x)
                as.numeric(dim(x))[1:2]))
    vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
    obj.dim[vec, 1] <- napply(names, length)[vec]
    out <- data.frame(obj.type, obj.size, obj.prettysize, obj.dim)
    names(out) <- c("Type", "Size", "PrettySize", "Rows", "Columns")
    if (!missing(order.by))
        out <- out[order(out[[order.by]], decreasing=decreasing), ]
    if (head)
        out <- head(out, n)
    out
}

# shorthand
lsos <- function(..., n=10) {
    .ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
}

lsos()

             Type     Size PrettySize   Rows Columns
r2     data.frame 20879368    19.9 Mb 231660       4
r4     data.frame  4880608     4.7 Mb 135233       8
r3     data.frame  4737864     4.5 Mb 196911       6
r5     data.frame  4140816     3.9 Mb 114800       7
r1     data.frame   965152   942.5 Kb   3861      62
pop1         list   204752     200 Kb      5      NA
pop    data.frame   182312     178 Kb   2592       3
states data.frame    11144    10.9 Kb     51       3
lsos     function     5400     5.3 Kb     NA      NA
years   character      368  368 bytes      5      NA


But how large is 20 Mb these days? I have 4 Gb of RAM. At the end of making this post 1.5 Gb is used.

No comments:

Post a Comment