Monday, April 21, 2014

High incidence in Measles Data in Project Tycho

In this third post on Measles data I want to have a look at some high incidence occasions. As described before, the data is from Project Tycho, which contains data from all weekly notifiable disease reports for the United States dating back to 1888. These data are freely available to anybody interested.

Data

Data reading follows the posting Looking at Measles Data in Project Tycho, part II. In the plot there, some data over 10 seemed to be displayed, which converts to 10 persons per 1000 in a week.
r6 <- r5[complete.cases(r5),]
r6[r6$incidence>10,]

      YEAR abb WEEK Cases   State pop incidence
49841 1939  MT   19  5860 Montana 555  10.55856
51076 1939  WY   17  3338 Wyoming 248  13.45968
51090 1939  WY   18  2509 Wyoming 248  10.11694

Indeed in 1939 three values are over 10. I have always thought you could only catch measles once, so this suggests a number of years with hardly measles must have occurred before.

Time

To have a decent plot I need a decent time variable.

Quick and dirty

My quick and dirty approach was to add a small fraction for weeks:
r6$time <- r6$YEAR+
        (r6$WEEK-.5)/with(r6,ave(WEEK,YEAR,State,FUN=max)+1)

Create a date using Formatting

After reading the post Date formating in R I tried a different approach. According to the manual:
%U
Week of the year as decimal number (00–53) using Sunday as the first day 1 of the week (and typically with the first Sunday of the year as day 1 of week 1). The US convention.

Unfortunately for me that did not work out for reading data:
> as.Date('02 1929',format='%U %Y')
[1] "1929-04-20"

20th of April is the day I am running the code, not the second week of 1929. It seems the %U is ignored in the R I compiled.

Reconstructing a date

I thought to start at the correct date and just add 7 for each week:
uu <- unique(data.frame(YEAR=r5$YEAR,WEEK=r5$WEEK))
uu <- uu[order(uu$YEAR,uu$WEEK),]
uu$Date <- as.Date('01-01-1928',

      '%d-%m-%Y')+seq(0,by=7,length.out=nrow(uu))+3
r7 <- merge(r6,uu)
tail(uu)

       YEAR WEEK       Date
112196 1970   47 1970-11-25
112197 1970   48 1970-12-02
112177 1970   49 1970-12-09
112183 1970   50 1970-12-16
112176 1970   51 1970-12-23
112191 1970   52 1970-12-30

Note that I cannot confirm the correct date. Second day of 1963 formats to week 0, which does not match my data. The year is correct though.
format(as.Date('1963-01-02'),format='%d %b %Y week: %U')
[1] "02 Jan 1963 week: 00"

Plot

The plot is at this point easy.
library(ggplot2)
png('wm.png')
ggplot(r7[r7$State %in% c('Wyoming','Montana') &
                r7$YEAR<1945 &
                r7$YEAR>1930 &
                r7$incidence >0,],
        aes(Date, incidence,group=State,colour=State)) +
    ylab('Incidence registered Measles Cases per week per 1000') +
    theme(text=element_text(family='Arial')) +
    geom_line() +
    scale_y_log10()

Indeed the years before 1939 have lower incidence of measles. What surprised me, the first years after 1939 also have less incidence.

References

Willem G. van Panhuis, John Grefenstette, Su Yon Jung, Nian Shong Chok, Anne Cross, Heather Eng, Bruce Y Lee, Vladimir Zadorozhny, Shawn Brown, Derek Cummings, Donald S. Burke. Contagious Diseases in the United States from 1888 to the present. NEJM 2013; 369(22): 2152-2158.

Extra

Completely unrelated, but if you are living in Amsterdam foto expo "Do you see me?" in Cafe Restaurant Nel might be worth a visit.


Sunday, April 13, 2014

Following open courseware

I love massive open online courses such as provided on Coursera and edX. So I enrolled in the Data Analysis for Genomics course on edX. I am not alone there as seen from this posting on FreshBiostats.

I was shocked when I took the Pre-Course R self-assessment, imagining this would be easy, click through some answers and done. But I read these questions "Use match() and indexing..." and "How would you use split() to...".  If I ever used match() and split() it must have been ages ago. Hat to use the help just to know what they did.
So, I am wondering how may other basic R functions I have forgotten. I remember searching for quite some time because I forgot ave(), must have forgotten that 3 or 4 times. Same for Reduce().
I am almost certain I have programmed the wheel once or twice, not knowing it is in a package sitting right in my computer. Hence even though boring, I find it a good thing to get down to basics again, but it would be nice to run these lectures at 1.5 speed.
Then the course suggests RStudio where my preference is Eclipse with StatET. It is probably good to be out of my comfort zone but I don't expect my taste to change. 
Finally, programs come in markup language (.Rmd files). I am curious to learn if they manage to make added value out of that. Let week 2 begin. 

Sunday, April 6, 2014

Looking at Measles Data in Project Tycho, part II

Continuing from last week, I will now look at incidence rates of measles in the US. To recap, Project Tycho contains data from all weekly notifiable disease reports for the United States dating back to 1888. These data are freely available to anybody interested.

Data

Tycho data

This follows the same pattern as last week.
r1 <- read.csv('MEASLES_Cases_1909-1982_20140323140631.csv',
    na.strings='-',
    skip=2)
r2 <- reshape(r1,
    varying=names(r1)[-c(1,2)],
    v.names='Cases',
    idvar=c('YEAR' , 'WEEK'),
    times=names(r1)[-c(1,2)],
    timevar='STATE',
    direction='long')
r2$STATE=factor(r2$STATE)

Population

The absolute counts are less of an interest, relevant is per 1000 or 100 000 inhabitants, hence population data are needed. Luckily these can be found at census.gov. Less lucky, they sit in text files per decade and the decades have slightly different layouts. On top of that, rather than printing all columns next to each other it is two sections. Based on my investigation last week, years 1920 to 1970 are retrieved. Please notice the units are thousands of inhabitants.
years <- dir(pattern='+.txt')
pop1 <-
    lapply(years,function(x) {
            rl <- readLines(x)
            print(x)
            sp <- grep('^U.S.',rl)
            st1 <- grep('^AL',rl)
            st2 <- grep('^WY',rl)
            rl1 <- rl[c(sp[1]-2,st1[1]:st2[1])]
            rl2 <- rl[c(sp[2]-2,st1[2]:st2[2])]
           
            read1 <- function(rlx) {
                rlx[1] <- paste('abb',rlx[1])
                rlx <- gsub(',','',rlx,fixed=TRUE)
                rt <- read.table(textConnection(rlx),header=TRUE)
                rt[,grep('census',names(rt),invert=TRUE)]
            }
            rr <- merge(read1(rl1),read1(rl2))
            ll <- reshape(rr,
                list(names(rr)[-1]),
                v.names='pop',
                timevar='YEAR',
                idvar='abb',
                times=as.integer(gsub('X','',names(rr)[-1])),
                direction='long')
        })

[1] "st2029ts.txt"
[1] "st3039ts.txt"
[1] "st4049ts.txt"
[1] "st5060ts.txt"
[1] "st6070ts.txt"

pop <- do.call(rbind,pop1)

Glue between datasets

It is not shown in the printout above, but the census data contains states as 2 character abbreviations, while the Tycho data has states as uppercase texts with  spaces replaced by dots, because they started out as variables. Some glue is needed. This can be done via the states data in the datasets package. DC is added manually, since it was not in the states data, but was present in both source data sets. Incidence is Cases/pop, and has as units cases per 1000 inhabitants per week. Variable STATE has served its purpose, so is removed.
states <- rbind(
    data.frame(
        abb=datasets::state.abb,
        State=datasets::state.name),
    data.frame(abb='DC',
        State='District of Columbia'))
states$STATE=gsub(' ','.',toupper(states$State))

r3 <- merge(r2,states)
r4 <- merge(r3,pop)
r4$incidence <- r4$Cases/r4$pop
r5 <- subset(r4,r4$YEAR>1927,-STATE)
head(r5)

      YEAR abb WEEK Cases   State  pop   incidence
20434 1928  AL   44     6 Alabama 2640 0.002272727
20435 1928  AL   27    45 Alabama 2640 0.017045455
20436 1928  AL   47    22 Alabama 2640 0.008333333
20437 1928  AL   26   106 Alabama 2640 0.040151515
20438 1928  AL   30    66 Alabama 2640 0.025000000
20439 1928  AL   18   251 Alabama 2640 0.095075758

Those who think R has a memory problem may want to delete some data here (r2, r3), see end of post. On the other hand, the objects are not that large.

Plots

Data are now on comparable scales, so I tried boxplots. By 1966 it seems measles was under control.

library(ggplot2)
ggplot(with(r5,aggregate(incidence,list(Year=YEAR,State=State),
                function(x) mean(x,na.rm=TRUE))),
        aes(factor(Year), x)) +
    geom_boxplot(notch = TRUE) +
    ylab('Incidence registered Measles Cases per week per 1000') +
    theme(text=element_text(family='Arial')) +
    scale_x_discrete(labels=
            ifelse(levels(factor(r5$YEAR)) %in%
                    seq(1920,1970,by=10),
                levels(factor(r5$YEAR)),
                ''))

The pattern within the years is still clearly visible. A slow increase from week 38 to week 20 of the next year. Then a fast decrease in the remaining 18 weeks.

ggplot(r5[r5$YEAR<1963,],
        aes(factor(WEEK), incidence)) +
    ylab('Incidence registered Measles Cases per week per 1000') +
    ggtitle('Measles 1928-1962')+
    theme(text=element_text(family='Arial')) +
    geom_boxplot(notch = TRUE) +
    scale_x_discrete(labels=
            ifelse(levels(factor(r5$WEEK)) %in%
                    seq(5,55,by=5),
                levels(factor(r5$WEEK)),
                ''))    +
    scale_y_log10()

References

Willem G. van Panhuis, John Grefenstette, Su Yon Jung, Nian Shong Chok, Anne Cross, Heather Eng, Bruce Y Lee, Vladimir Zadorozhny, Shawn Brown, Derek Cummings, Donald S. Burke. Contagious Diseases in the United States from 1888 to the present. NEJM 2013; 369(22): 2152-2158.

Deleting data?

These are not the smallest objects;
#https://gist.github.com/jedifran/1447362
.ls.objects <- function (pos = 1, pattern, order.by,
    decreasing=FALSE, head=FALSE, n=5) {
    napply <- function(names, fn) sapply(names, function(x)
                fn(get(x, pos = pos)))
    names <- ls(pos = pos, pattern = pattern)
    obj.class <- napply(names, function(x) as.character(class(x))[1])
    obj.mode <- napply(names, mode)
    obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
    obj.prettysize <- napply(names, function(x) {
            capture.output(print(object.size(x), units = "auto")) })
    obj.size <- napply(names, object.size)
    obj.dim <- t(napply(names, function(x)
                as.numeric(dim(x))[1:2]))
    vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
    obj.dim[vec, 1] <- napply(names, length)[vec]
    out <- data.frame(obj.type, obj.size, obj.prettysize, obj.dim)
    names(out) <- c("Type", "Size", "PrettySize", "Rows", "Columns")
    if (!missing(order.by))
        out <- out[order(out[[order.by]], decreasing=decreasing), ]
    if (head)
        out <- head(out, n)
    out
}

# shorthand
lsos <- function(..., n=10) {
    .ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
}

lsos()

             Type     Size PrettySize   Rows Columns
r2     data.frame 20879368    19.9 Mb 231660       4
r4     data.frame  4880608     4.7 Mb 135233       8
r3     data.frame  4737864     4.5 Mb 196911       6
r5     data.frame  4140816     3.9 Mb 114800       7
r1     data.frame   965152   942.5 Kb   3861      62
pop1         list   204752     200 Kb      5      NA
pop    data.frame   182312     178 Kb   2592       3
states data.frame    11144    10.9 Kb     51       3
lsos     function     5400     5.3 Kb     NA      NA
years   character      368  368 bytes      5      NA


But how large is 20 Mb these days? I have 4 Gb of RAM. At the end of making this post 1.5 Gb is used.

Sunday, March 30, 2014

Looking at Measles Data in Project Tycho

Project Tycho includes data from all weekly notifiable disease reports for the United States dating back to 1888. These data are freely available to anybody interested. I wanted to play around with the data a bit, so I registered.

Measles

Measles are in level 2 data. These are standardized data for immediate use and include a large number of diseases, locations, and years. These data are not complete because standardization is ongoing. The data are retrieved as measles cases and while I know I should convert to cases per 100 000, I have not done so here.
The data come in wide format, so the first step is conversion to long format. The Northern Mariana Islands variable was created as logical, so I removed it. In addition, data from before 1927 seemed very sparse, so those are removed too.
r1 <- read.csv('MEASLES_Cases_1909-1982_20140323140631.csv',
    na.strings='-',
    skip=2)
r1 <- subset(r1,,-NORTHERN.MARIANA.ISLANDS)
r2 <- reshape(r1,
    varying=names(r1)[-c(1,2)],
    v.names='Cases',
    idvar=c('YEAR' , 'WEEK'),
    times=names(r1)[-c(1,2)],
    timevar='State',
    direction='long')
r2$State=factor(r2$State)
r3 <- r2[r2$YEAR>1927,]

Plotting

The first plot is total cases by year. It shows the drop in cases from vaccine (Licensed vaccines to prevent the disease became available in 1963. An improved measles vaccine became available in 1968.)
qplot(x=Year,
        y=x,
        data=with(r3,aggregate(Cases,list(Year=YEAR),function(x) sum(x,na.rm=TRUE))),


        ylab='Sum registered Measles Cases')+
    theme(text=element_text(family='Arial'))


Occurrence within a year by week

Winter and spring seems to be the periods in which most cases occur. The curve seems quite smooth, with a few small fluctuations. The transfer between week 52 and week 1 is a bit steep, which may be because I removed week 53 (only present in part of the years).

qplot(x=Week,
        y=x,
        data=with(r3[r3$WEEK!=53 & r3$YEAR<1963,],
            aggregate(Cases,list(Week=WEEK),
                function(x) sum(x,na.rm=TRUE))),
        ylab='Sum Registered Measles Cases',
        main='Measles 1928-1962')+
    theme(text=element_text(family='Arial'))

A more detailed look

Trying to understand why the week plot was not smooth, I made that plot with year facets. This revealed an interesting number of zeros, which are an artefact of processing method (remember, sum(c(NA,NA),na.rm=TRUE)=0). I do not know if the data distinguishes between 0 and '-'. There are 872 occurrences of 0 which suggests 0 is used. On the other hand, week 6 and 9 in 1980 in Arkansas each have one case, the other weeks from 1 to 22 are '-', which suggests 0 is not used. My feeling for that time part is that registration became lax after measles was under control and getting reliable data from the underlying documentation is a laborious task. 

References


Willem G. van Panhuis, John Grefenstette, Su Yon Jung, Nian Shong Chok, Anne Cross, Heather Eng, Bruce Y Lee, Vladimir Zadorozhny, Shawn Brown, Derek Cummings, Donald S. Burke. Contagious Diseases in the United States from 1888 to the present. NEJM 2013; 369(22): 2152-2158.

Sunday, March 23, 2014

Distribution Kinetics in JAGS

Chapter 19 of Rowland and Tozer (Clinical pharmacokinetics and pharmacodynamics, 4th edition) has distribution kinetics. I am examining problem 3. It is fairly easy, especially since essentially all formulas are on the same page under 'key relationships'. The only difficulty is that the formulas are symmetrical in λ1, c1 and λ2, cand they are exchanged between chains. For this I forced λ12. In addition, I do not really believe concentrations in the 4000 range are as accurately measured as those in the 5 range (in the period 1/2 hour to 1 hour, the decrease is about 80 units per minute). The measurement error is now proportional to concentration.

Data and model

C19SP3 <- data.frame(
    time=c(0.5,1,1.5,2,3,4,5,8,12,16,24,36,48),
    conc=c(4211,1793,808,405,168,122,101,88,67,51,30,13,6))

library(R2jags)
datain <- list(
    time=C19SP3$time,
    lconc=log(C19SP3$conc),
    n=nrow(C19SP3),
    dose=30*1000)

model1 <- function() {
  tau <- 1/pow(sigma,2)
  sigma ~ dunif(0,100)
  llambda1 ~dlnorm(.4,.1)
  cc1 ~ dlnorm(1,.01)
  llambda2 ~dlnorm(.4,.1)
  cc2 ~ dlnorm(1,.01)
  choice <- llambda1>llambda2
  c1 <- choice*cc1+(1-choice)*cc2
  c2 <- choice*cc2+(1-choice)*cc1
  lambda1 <- choice*llambda1+(1-choice)*llambda2
  lambda2 <- choice*llambda2+(1-choice)*llambda1
  
  for (i in 1:n) {
    pred[i] <- log(c1*exp(-lambda1*time[i]) +c2*exp(-lambda2*time[i]))
    lconc[i] ~ dnorm(pred[i],tau)
  }
  V1 <- dose/(c1+c2)
  AUC <- c1/lambda1+c2/lambda2
  CL <- dose/AUC
  V <- CL/lambda2
  Vss <- dose*(c1/pow(lambda1,2)+c2/pow(lambda2,2))/pow(AUC,2)
  }

parameters <- c('c1','c2','lambda1','lambda2' ,
   'V1','CL','Vss','AUC','V')
inits <- function() 
  list(
      sigma=rnorm(1,1,.1),
      cc1=9000,
      cc2=150)

jagsfit <- jags(datain, model=model1, 
    inits=inits, 
    parameters=parameters,progress.bar="gui",
    n.chains=4,n.iter=14000,n.burnin=5000,n.thin=2)

Results

Results same as in the book:
Inference for Bugs model at "C:\...5a.txt", fit using jags,
 4 chains, each with 14000 iterations (first 5000 discarded), n.thin = 2
 n.sims = 18000 iterations saved
          mu.vect sd.vect     2.5%      25%      50%       75%     97.5%  Rhat n.eff
AUC      7752.778 130.145 7498.102 7670.443 7751.138  7831.068  8016.879 1.002  3000
CL          3.871   0.065    3.742    3.831    3.870     3.911     4.001 1.002  3000
V          57.971   1.210   55.650   57.215   57.956    58.708    60.401 1.002  4000
V1          2.980   0.104    2.776    2.915    2.978     3.044     3.192 1.002  2800
Vss        18.038   0.600   16.865   17.662   18.029    18.404    19.251 1.002  3900
c1       9933.578 352.138 9253.229 9709.652 9927.037 10145.611 10659.610 1.002  2800
c2        147.197   2.412  142.333  145.734  147.207   148.659   152.069 1.001  5000
lambda1     1.790   0.028    1.734    1.772    1.790     1.807     1.847 1.002  2800
lambda2     0.067   0.001    0.065    0.066    0.067     0.067     0.068 1.001 10000
deviance  -59.366   4.394  -65.150  -62.608  -60.275   -57.058   -48.524 1.001  4100

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 9.6 and DIC = -49.7
DIC is an estimate of expected predictive error (lower deviance is better).

Plot

The plot has narrow intervals, which is a reflection of the small intervals in the parameters.

Previous posts in this series:

Simple Pharmacokinetics with Jags