Sunday, April 13, 2014

Following open courseware

I love massive open online courses such as provided on Coursera and edX. So I enrolled in the Data Analysis for Genomics course on edX. I am not alone there as seen from this posting on FreshBiostats.

I was shocked when I took the Pre-Course R self-assessment, imagining this would be easy, click through some answers and done. But I read these questions "Use match() and indexing..." and "How would you use split() to...".  If I ever used match() and split() it must have been ages ago. Hat to use the help just to know what they did.
So, I am wondering how may other basic R functions I have forgotten. I remember searching for quite some time because I forgot ave(), must have forgotten that 3 or 4 times. Same for Reduce().
I am almost certain I have programmed the wheel once or twice, not knowing it is in a package sitting right in my computer. Hence even though boring, I find it a good thing to get down to basics again, but it would be nice to run these lectures at 1.5 speed.
Then the course suggests RStudio where my preference is Eclipse with StatET. It is probably good to be out of my comfort zone but I don't expect my taste to change. 
Finally, programs come in markup language (.Rmd files). I am curious to learn if they manage to make added value out of that. Let week 2 begin. 

Sunday, April 6, 2014

Looking at Measles Data in Project Tycho, part II

Continuing from last week, I will now look at incidence rates of measles in the US. To recap, Project Tycho contains data from all weekly notifiable disease reports for the United States dating back to 1888. These data are freely available to anybody interested.

Data

Tycho data

This follows the same pattern as last week.
r1 <- read.csv('MEASLES_Cases_1909-1982_20140323140631.csv',
    na.strings='-',
    skip=2)
r2 <- reshape(r1,
    varying=names(r1)[-c(1,2)],
    v.names='Cases',
    idvar=c('YEAR' , 'WEEK'),
    times=names(r1)[-c(1,2)],
    timevar='STATE',
    direction='long')
r2$STATE=factor(r2$STATE)

Population

The absolute counts are less of an interest, relevant is per 1000 or 100 000 inhabitants, hence population data are needed. Luckily these can be found at census.gov. Less lucky, they sit in text files per decade and the decades have slightly different layouts. On top of that, rather than printing all columns next to each other it is two sections. Based on my investigation last week, years 1920 to 1970 are retrieved. Please notice the units are thousands of inhabitants.
years <- dir(pattern='+.txt')
pop1 <-
    lapply(years,function(x) {
            rl <- readLines(x)
            print(x)
            sp <- grep('^U.S.',rl)
            st1 <- grep('^AL',rl)
            st2 <- grep('^WY',rl)
            rl1 <- rl[c(sp[1]-2,st1[1]:st2[1])]
            rl2 <- rl[c(sp[2]-2,st1[2]:st2[2])]
           
            read1 <- function(rlx) {
                rlx[1] <- paste('abb',rlx[1])
                rlx <- gsub(',','',rlx,fixed=TRUE)
                rt <- read.table(textConnection(rlx),header=TRUE)
                rt[,grep('census',names(rt),invert=TRUE)]
            }
            rr <- merge(read1(rl1),read1(rl2))
            ll <- reshape(rr,
                list(names(rr)[-1]),
                v.names='pop',
                timevar='YEAR',
                idvar='abb',
                times=as.integer(gsub('X','',names(rr)[-1])),
                direction='long')
        })

[1] "st2029ts.txt"
[1] "st3039ts.txt"
[1] "st4049ts.txt"
[1] "st5060ts.txt"
[1] "st6070ts.txt"

pop <- do.call(rbind,pop1)

Glue between datasets

It is not shown in the printout above, but the census data contains states as 2 character abbreviations, while the Tycho data has states as uppercase texts with  spaces replaced by dots, because they started out as variables. Some glue is needed. This can be done via the states data in the datasets package. DC is added manually, since it was not in the states data, but was present in both source data sets. Incidence is Cases/pop, and has as units cases per 1000 inhabitants per week. Variable STATE has served its purpose, so is removed.
states <- rbind(
    data.frame(
        abb=datasets::state.abb,
        State=datasets::state.name),
    data.frame(abb='DC',
        State='District of Columbia'))
states$STATE=gsub(' ','.',toupper(states$State))

r3 <- merge(r2,states)
r4 <- merge(r3,pop)
r4$incidence <- r4$Cases/r4$pop
r5 <- subset(r4,r4$YEAR>1927,-STATE)
head(r5)

      YEAR abb WEEK Cases   State  pop   incidence
20434 1928  AL   44     6 Alabama 2640 0.002272727
20435 1928  AL   27    45 Alabama 2640 0.017045455
20436 1928  AL   47    22 Alabama 2640 0.008333333
20437 1928  AL   26   106 Alabama 2640 0.040151515
20438 1928  AL   30    66 Alabama 2640 0.025000000
20439 1928  AL   18   251 Alabama 2640 0.095075758

Those who think R has a memory problem may want to delete some data here (r2, r3), see end of post. On the other hand, the objects are not that large.

Plots

Data are now on comparable scales, so I tried boxplots. By 1966 it seems measles was under control.

library(ggplot2)
ggplot(with(r5,aggregate(incidence,list(Year=YEAR,State=State),
                function(x) mean(x,na.rm=TRUE))),
        aes(factor(Year), x)) +
    geom_boxplot(notch = TRUE) +
    ylab('Incidence registered Measles Cases per week per 1000') +
    theme(text=element_text(family='Arial')) +
    scale_x_discrete(labels=
            ifelse(levels(factor(r5$YEAR)) %in%
                    seq(1920,1970,by=10),
                levels(factor(r5$YEAR)),
                ''))

The pattern within the years is still clearly visible. A slow increase from week 38 to week 20 of the next year. Then a fast decrease in the remaining 18 weeks.

ggplot(r5[r5$YEAR<1963,],
        aes(factor(WEEK), incidence)) +
    ylab('Incidence registered Measles Cases per week per 1000') +
    ggtitle('Measles 1928-1962')+
    theme(text=element_text(family='Arial')) +
    geom_boxplot(notch = TRUE) +
    scale_x_discrete(labels=
            ifelse(levels(factor(r5$WEEK)) %in%
                    seq(5,55,by=5),
                levels(factor(r5$WEEK)),
                ''))    +
    scale_y_log10()

References

Willem G. van Panhuis, John Grefenstette, Su Yon Jung, Nian Shong Chok, Anne Cross, Heather Eng, Bruce Y Lee, Vladimir Zadorozhny, Shawn Brown, Derek Cummings, Donald S. Burke. Contagious Diseases in the United States from 1888 to the present. NEJM 2013; 369(22): 2152-2158.

Deleting data?

These are not the smallest objects;
#https://gist.github.com/jedifran/1447362
.ls.objects <- function (pos = 1, pattern, order.by,
    decreasing=FALSE, head=FALSE, n=5) {
    napply <- function(names, fn) sapply(names, function(x)
                fn(get(x, pos = pos)))
    names <- ls(pos = pos, pattern = pattern)
    obj.class <- napply(names, function(x) as.character(class(x))[1])
    obj.mode <- napply(names, mode)
    obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
    obj.prettysize <- napply(names, function(x) {
            capture.output(print(object.size(x), units = "auto")) })
    obj.size <- napply(names, object.size)
    obj.dim <- t(napply(names, function(x)
                as.numeric(dim(x))[1:2]))
    vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
    obj.dim[vec, 1] <- napply(names, length)[vec]
    out <- data.frame(obj.type, obj.size, obj.prettysize, obj.dim)
    names(out) <- c("Type", "Size", "PrettySize", "Rows", "Columns")
    if (!missing(order.by))
        out <- out[order(out[[order.by]], decreasing=decreasing), ]
    if (head)
        out <- head(out, n)
    out
}

# shorthand
lsos <- function(..., n=10) {
    .ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
}

lsos()

             Type     Size PrettySize   Rows Columns
r2     data.frame 20879368    19.9 Mb 231660       4
r4     data.frame  4880608     4.7 Mb 135233       8
r3     data.frame  4737864     4.5 Mb 196911       6
r5     data.frame  4140816     3.9 Mb 114800       7
r1     data.frame   965152   942.5 Kb   3861      62
pop1         list   204752     200 Kb      5      NA
pop    data.frame   182312     178 Kb   2592       3
states data.frame    11144    10.9 Kb     51       3
lsos     function     5400     5.3 Kb     NA      NA
years   character      368  368 bytes      5      NA


But how large is 20 Mb these days? I have 4 Gb of RAM. At the end of making this post 1.5 Gb is used.

Sunday, March 30, 2014

Looking at Measles Data in Project Tycho

Project Tycho includes data from all weekly notifiable disease reports for the United States dating back to 1888. These data are freely available to anybody interested. I wanted to play around with the data a bit, so I registered.

Measles

Measles are in level 2 data. These are standardized data for immediate use and include a large number of diseases, locations, and years. These data are not complete because standardization is ongoing. The data are retrieved as measles cases and while I know I should convert to cases per 100 000, I have not done so here.
The data come in wide format, so the first step is conversion to long format. The Northern Mariana Islands variable was created as logical, so I removed it. In addition, data from before 1927 seemed very sparse, so those are removed too.
r1 <- read.csv('MEASLES_Cases_1909-1982_20140323140631.csv',
    na.strings='-',
    skip=2)
r1 <- subset(r1,,-NORTHERN.MARIANA.ISLANDS)
r2 <- reshape(r1,
    varying=names(r1)[-c(1,2)],
    v.names='Cases',
    idvar=c('YEAR' , 'WEEK'),
    times=names(r1)[-c(1,2)],
    timevar='State',
    direction='long')
r2$State=factor(r2$State)
r3 <- r2[r2$YEAR>1927,]

Plotting

The first plot is total cases by year. It shows the drop in cases from vaccine (Licensed vaccines to prevent the disease became available in 1963. An improved measles vaccine became available in 1968.)
qplot(x=Year,
        y=x,
        data=with(r3,aggregate(Cases,list(Year=YEAR),function(x) sum(x,na.rm=TRUE))),


        ylab='Sum registered Measles Cases')+
    theme(text=element_text(family='Arial'))


Occurrence within a year by week

Winter and spring seems to be the periods in which most cases occur. The curve seems quite smooth, with a few small fluctuations. The transfer between week 52 and week 1 is a bit steep, which may be because I removed week 53 (only present in part of the years).

qplot(x=Week,
        y=x,
        data=with(r3[r3$WEEK!=53 & r3$YEAR<1963,],
            aggregate(Cases,list(Week=WEEK),
                function(x) sum(x,na.rm=TRUE))),
        ylab='Sum Registered Measles Cases',
        main='Measles 1928-1962')+
    theme(text=element_text(family='Arial'))

A more detailed look

Trying to understand why the week plot was not smooth, I made that plot with year facets. This revealed an interesting number of zeros, which are an artefact of processing method (remember, sum(c(NA,NA),na.rm=TRUE)=0). I do not know if the data distinguishes between 0 and '-'. There are 872 occurrences of 0 which suggests 0 is used. On the other hand, week 6 and 9 in 1980 in Arkansas each have one case, the other weeks from 1 to 22 are '-', which suggests 0 is not used. My feeling for that time part is that registration became lax after measles was under control and getting reliable data from the underlying documentation is a laborious task. 

References


Willem G. van Panhuis, John Grefenstette, Su Yon Jung, Nian Shong Chok, Anne Cross, Heather Eng, Bruce Y Lee, Vladimir Zadorozhny, Shawn Brown, Derek Cummings, Donald S. Burke. Contagious Diseases in the United States from 1888 to the present. NEJM 2013; 369(22): 2152-2158.

Sunday, March 23, 2014

Distribution Kinetics in JAGS

Chapter 19 of Rowland and Tozer (Clinical pharmacokinetics and pharmacodynamics, 4th edition) has distribution kinetics. I am examining problem 3. It is fairly easy, especially since essentially all formulas are on the same page under 'key relationships'. The only difficulty is that the formulas are symmetrical in λ1, c1 and λ2, cand they are exchanged between chains. For this I forced λ12. In addition, I do not really believe concentrations in the 4000 range are as accurately measured as those in the 5 range (in the period 1/2 hour to 1 hour, the decrease is about 80 units per minute). The measurement error is now proportional to concentration.

Data and model

C19SP3 <- data.frame(
    time=c(0.5,1,1.5,2,3,4,5,8,12,16,24,36,48),
    conc=c(4211,1793,808,405,168,122,101,88,67,51,30,13,6))

library(R2jags)
datain <- list(
    time=C19SP3$time,
    lconc=log(C19SP3$conc),
    n=nrow(C19SP3),
    dose=30*1000)

model1 <- function() {
  tau <- 1/pow(sigma,2)
  sigma ~ dunif(0,100)
  llambda1 ~dlnorm(.4,.1)
  cc1 ~ dlnorm(1,.01)
  llambda2 ~dlnorm(.4,.1)
  cc2 ~ dlnorm(1,.01)
  choice <- llambda1>llambda2
  c1 <- choice*cc1+(1-choice)*cc2
  c2 <- choice*cc2+(1-choice)*cc1
  lambda1 <- choice*llambda1+(1-choice)*llambda2
  lambda2 <- choice*llambda2+(1-choice)*llambda1
  
  for (i in 1:n) {
    pred[i] <- log(c1*exp(-lambda1*time[i]) +c2*exp(-lambda2*time[i]))
    lconc[i] ~ dnorm(pred[i],tau)
  }
  V1 <- dose/(c1+c2)
  AUC <- c1/lambda1+c2/lambda2
  CL <- dose/AUC
  V <- CL/lambda2
  Vss <- dose*(c1/pow(lambda1,2)+c2/pow(lambda2,2))/pow(AUC,2)
  }

parameters <- c('c1','c2','lambda1','lambda2' ,
   'V1','CL','Vss','AUC','V')
inits <- function() 
  list(
      sigma=rnorm(1,1,.1),
      cc1=9000,
      cc2=150)

jagsfit <- jags(datain, model=model1, 
    inits=inits, 
    parameters=parameters,progress.bar="gui",
    n.chains=4,n.iter=14000,n.burnin=5000,n.thin=2)

Results

Results same as in the book:
Inference for Bugs model at "C:\...5a.txt", fit using jags,
 4 chains, each with 14000 iterations (first 5000 discarded), n.thin = 2
 n.sims = 18000 iterations saved
          mu.vect sd.vect     2.5%      25%      50%       75%     97.5%  Rhat n.eff
AUC      7752.778 130.145 7498.102 7670.443 7751.138  7831.068  8016.879 1.002  3000
CL          3.871   0.065    3.742    3.831    3.870     3.911     4.001 1.002  3000
V          57.971   1.210   55.650   57.215   57.956    58.708    60.401 1.002  4000
V1          2.980   0.104    2.776    2.915    2.978     3.044     3.192 1.002  2800
Vss        18.038   0.600   16.865   17.662   18.029    18.404    19.251 1.002  3900
c1       9933.578 352.138 9253.229 9709.652 9927.037 10145.611 10659.610 1.002  2800
c2        147.197   2.412  142.333  145.734  147.207   148.659   152.069 1.001  5000
lambda1     1.790   0.028    1.734    1.772    1.790     1.807     1.847 1.002  2800
lambda2     0.067   0.001    0.065    0.066    0.067     0.067     0.068 1.001 10000
deviance  -59.366   4.394  -65.150  -62.608  -60.275   -57.058   -48.524 1.001  4100

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 9.6 and DIC = -49.7
DIC is an estimate of expected predictive error (lower deviance is better).

Plot

The plot has narrow intervals, which is a reflection of the small intervals in the parameters.

Previous posts in this series:

Simple Pharmacokinetics with Jags

Sunday, March 16, 2014

PK calculations for infusion at constant rate

In this third PK posting I move to chapter 10, study problem 4 of Rowland and Tozer (Clinical pharmacokinetics and pharmacodynamics, 4th edition). In this problem one subject gets a 24 hours continuous dose. In many respects the Jags calculation is not very different from before, but the relation between parameters in the model and PK parameters is a bit different.

Data

Data is from infusion at constant rate, (1.5 µm/hr, but the calculations come out on the expected values with 1.5 µg/hr, and no molecular weight is given in the question) for 24 hours. Drug concentration (µg/L) in plasma is measured for 28 hours.
ch10sp4 <- data.frame(
    time=c(0,1,2,4,6,7.5,10.5,14,18,24,25,26,27,28),
    conc=c(0,4.2,14.5,21,23,19.8,22,20,18,21,18,11.6,7.1,4.1))

Model

The model is straightforward, but does suffer a bit from sensitivity of init values. The construction (time<24 and time>=24 is just a quick way to ensure that the up and down doing parts of the model sit at the correct spots in time.
library(R2jags)
datain <- as.list(ch10sp4)
datain$n <- nrow(ch10sp4)
datain$Rinf <- 1.5*1000
model1 <- function() {
  for (i in 1:n) {
    pred[i] <- Css*((1-exp(-k*time[i]))*(time[i]<24)
                  +exp(-k*(time[i]-24))*(time[i]>=24))
    conc[i] ~ dnorm(pred[i],tau)
  }
  tau <- 1/pow(sigma,2)
  sigma ~ dunif(0,100)
  Css <- Rinf/CL
  k <- CL/V
  V ~ dlnorm(2,.01)
  CL ~ dlnorm(1,.01)
  t0_5 <- log(2)/k
  Cssalt <- 3000/CL
  h1 <- Cssalt*(1-exp(-k))
  h2 <- Cssalt*(1-exp(-2*k))
  CLpermin <- CL/60
}

parameters <- c('k','CL','V','t0_5','Cssalt','h1','h2','Css','CLpermin')
inits <- function() 
  list(V=rlnorm(1,log(100),.1),CL=rlnorm(1,log(70),.1),sigma=rlnorm(1,1,.1))
jagsfit <- jags(datain, model=model1, 
    inits=inits, 
    parameters=parameters,progress.bar="gui",
    n.chains=4,n.iter=14000,n.burnin=5000,n.thin=2)
jagsfit
Inference for Bugs model at "C:/Users/.../modelf7824253314.txt", fit using jags,
 4 chains, each with 14000 iterations (first 5000 discarded), n.thin = 2
 n.sims = 18000 iterations saved
         mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
CL        68.021   3.252  62.069  65.902  67.845  69.932  75.065 1.001  6600
CLpermin   1.134   0.054   1.034   1.098   1.131   1.166   1.251 1.001  6600
Css       22.102   1.046  19.983  21.449  22.109  22.761  24.167 1.001  6600
Cssalt    44.204   2.092  39.965  42.899  44.218  45.522  48.333 1.001  6600
V        170.798  23.922 127.164 155.118 169.417 184.662 222.523 1.001 18000
h1        14.679   1.698  11.572  13.575  14.586  15.685  18.367 1.001 18000
h2        24.419   2.307  20.014  22.913  24.360  25.849  29.249 1.001 18000
k          0.406   0.058   0.306   0.368   0.401   0.438   0.536 1.001 18000
t0_5       1.742   0.243   1.293   1.583   1.730   1.886   2.267 1.001 18000
deviance  66.267   3.024  62.809  64.041  65.477  67.692  74.131 1.002  2000

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 4.6 and DIC = 70.8
DIC is an estimate of expected predictive error (lower deviance is better).
Expected values: CL=1.2 L/min, t1/2=1.6 hr, V=164 L. The values are a bit off, it was given in the text that Css is 21 mg/L while the model finds 22.
Three additional parameters; Cssalt, h1, h2 correspond to part (c) of the question; what concentrations if the dosage is doubled. Expected answers: Cssalt=42, h1=14.8, h2=24.

Plot of the model

Last week I made a similar plot in classical R graphics, so this week using ggplot2. What I personally notice from a modelling perspective is that the narrowness of the band near time=0 which misses the point at time 1 h. The whole pattern of points there almost suggests a small delay and steeper increase (lower t1/2) would make for a better fit. Interestingly, the same can be seen in the decreasing part of the curve. From a data perspective, the wide variability around the plateau concentration may be important. Perhaps that particular variation is the cause of the steepness which I mentioned before. 

part (b) Is the curve in the up and down going part equally steep?

This requires a different model, since right now by definition they are equally shaped. The parameter k12, difference between k1 and k2, is the parameter of interest.
datain2 <- as.list(ch10sp4)
datain2$n <- nrow(ch10sp4)
model2 <- function() {
  for (i in 1:n) {
    pred[i] <- Css*((1-exp(-k1*time[i]))*(time[i]<24)
          +exp(-k2*(time[i]-24))*(time[i]>=24))
    conc[i] ~ dnorm(pred[i],tau)
  }
  tau <- 1/pow(sigma,2)
  sigma ~ dunif(0,100)
  k1 ~ dlnorm(0,0.001)
  k2 ~ dlnorm(0,0.001)
  Css ~ dnorm(21,.01)
  k12 <- k1-k2
}

parameters2 <- c('k1','k2','k12','Css')
jagsfit2 <- jags(datain2, model=model2,  
    parameters=parameters2,progress.bar="gui",
    n.chains=4,n.iter=14000,n.burnin=5000,n.thin=2)
Inference for Bugs model at "C:/Users/...modelf783e02e64.txt", fit using jags,
 4 chains, each with 14000 iterations (first 5000 discarded), n.thin = 2
 n.sims = 18000 iterations saved
         mu.vect sd.vect   2.5%    25%    50%    75%  97.5%  Rhat n.eff
Css       21.275   1.131 19.089 20.559 21.258 21.982 23.578 1.001  6800
k1         0.529   0.138  0.333  0.446  0.512  0.588  0.814 1.002  8300
k12        0.194   0.163 -0.074  0.099  0.182  0.272  0.520 1.015  3800
k2         0.334   0.068  0.215  0.291  0.329  0.374  0.480 1.001  8600
deviance  64.761   3.663 60.215 62.073 63.929 66.528 73.979 1.002  3900

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 6.7 and DIC = 71.5

DIC is an estimate of expected predictive error (lower deviance is better).
It seems the difference is is just not large enough to be convincing, which is what was supposed to be found. Css has now decreased compared to the first model, closer to the value which was given.

Previous posts in this series:

Simple Pharmacokinetics with Jags