Sunday, November 17, 2013

Dutch Rainwater Composition 1992-2005.

After reading Blog About Stats' Open Data Index Blog Post I decided to browse a bit in the Open Data Index. Choosing Netherlands and following Emission of Pollutants I ended on a page from National Institute for Public Health. The page http://www.lml.rivm.nl/data/gevalideerd/index.html is in Dutch and chose composition of rain water 1992-2005.There is also 2006-2008, 2009, 2010 and 2011. In 2006 machines have changed, 'a report describing changes is in preparation'. I am only using 1992-2005 in this post.

Data

The data is in a spreadsheet, one page per location. There is quite a big header on each page and each data column has an associated column marking data below detection limit. This being real world data there are also the odd missing data.
Following Read Excel Data from R I chose to read the data via XLConnect. The whole process is wrapped in a function, So I can read all pages (all locations).
library(XLConnect)
wb = loadWorkbook("lmre_1992-2005.xls")
sheets <- getSheets(wb)[-1]
readsheet <- function(cursheet) {
  df = readWorksheet(wb, sheet = cursheet , header = TRUE)
  topline <- 8
  test <- which(df[6,]=='C')+1
  numin <- df[topline:nrow(df),test]
  names(numin) <- make.names( gsub(' ','',df[6,test]))
  for(i in 1:ncol(numin)) numin[,i] <- as.numeric(gsub(',','.',numin[,i]))
  status = df[topline:nrow(df),test-1]
  names(status) <- paste('C',make.names( gsub(' ','',df[6,test])),sep='')
  numin$StartDate <- as.Date(substr(df[topline:nrow(df),1],1,10))
  numin$EndDate <-  as.Date(substr(df[topline:nrow(df),2],1,10))
  numin <- cbind(numin,status)
  tall <- reshape(numin,
      varying=list(make.names( gsub(' ','',df[6,test])),
         paste('C',make.names( gsub(' ','',df[6,test])),sep='')),
      v.names=c('Amount','Status'),
      timevar='Compound',
      idvar=c('StartDate','EndDate'),
      times=paste(df[6,test],'(',df[5,test],')',sep=''),
      direction='long')
  tall$Compound <- factor(gsub(' )',')',gsub(' +',' ',tall$Compound)))
  row.names(tall)<- NULL
  tall$Location <- cursheet
  tall
}
tt <- lapply(sheets,readsheet)
tt2 <- do.call(rbind,tt) 
tt2$Location <- factor(tt2$Location)

Plots

There are certainly many variables to be analyzed, more than I would do in a blog post. I have chosen pH (acid rain) and Iron (Fe).
levels(tt2$Compound)
 [1] "As (umol/l)"     "Ca (umol/l)"     "Cd (umol/l)"     
 [4] "Cl (umol/l)"     "Co (umol/l)"     "Cr (umol/l)"    
 [7] "Cu (umol/l)"     "F (umol/l)"      "Fe (umol/l)"     
[10] "K (umol/l)"      "K25 (uS/cm)"     "massa_hc (g)"   
[13] "massa_zm (g)"    "Mg (umol/l)"     "Mn (umol/l)"   
[16] "Na (umol/l)"     "NH4 (umol/l)"    "Ni (umol/l)"     
[19] "NO3 (umol/l)"    "nsl (mm)"        "Pb (umol/l)"     
[22] "pH (pH-eenheid)" "PO4 (umol/l)"    "SO4 (umol/l)"
[25] "st.zr. (umol/l)" "V (umol/l)"      "Zn (umol/l)" 

pH   

Acid rain was quite a scare last century. We should be able to see progress, especially in the nineties, when there was quite some activity on it. One location is not shown; Leiduin has data complimentary to 'De Zilk'. Removal of this location gave a much nicer layout for the remaining data.
The plots show progress, but some locations (e.g. De Zilk, Rotterdam) there is still a way to go. Is Vredepeel going more acid again? I may need to add more data after all.
library(ggplot2)
library(scales)
(cc <- levels(tt2$Compound)[22])
pp <- ggplot(tt2[tt2$Compound==cc & tt2$Location!='Leiduin',], 
    aes( StartDate,Amount))
pp + geom_point() +
    facet_wrap(~Location) +
    ylab(cc) + xlab('Year') +
    scale_x_date(breaks='4 years',labels = date_format("%Y"))

Iron

Iron is plotted on a logarithmic y scale. Disadvantage is that this hides extreme high (Kollumerwaard) and below zero's. However, on this scale the data seem more or less 'normal'.
(cc <- levels(tt2$Compound)[9])
pp <- ggplot(tt2[tt2$Compound==cc & tt2$Location!='Leiduin',], aes( StartDate,Amount))
pp + geom_point() +
    scale_y_log10() +
    facet_wrap(~Location) +
    ylab(cc) + xlab('Year') +
    scale_x_date(breaks='4 years',labels = date_format("%Y"))

Mixed model for Iron

Just to get some quantification on the reduction I ran a model. The obvious choice is again to use logarithmic transformed data. However, some zeros unleashed havoc on the subsequent estimates. Hence these data are made NA prior to computations.
library(lme4) 
Fe <- tt2[tt2$Compound==levels(tt2$Compound)[9],]
# plot(density(Fe$Amount[!is.na(Fe$Status) & Fe$Status!=0],adjust=.5))
Fe$LAmount <- log10(Fe$Amount)
Fe$LAmount[Fe$Amount<=0] <- NA
ll <- lmer(LAmount ~  StartDate + (1|Location) +(StartDate|Location ) ,
    data=Fe)
summary(ll)
Linear mixed model fit by REML ['lmerMod']
Formula: LAmount ~ StartDate + (1 | Location) + 
        (StartDate | Location) 
   Data: Fe 

REML criterion at convergence: 2170.498 

Random effects:
 Groups     Name        Variance  Std.Dev.  Corr
 Location   (Intercept) 4.727e-11 6.875e-06     
 Location.1 (Intercept) 1.456e-04 1.207e-02     
            StartDate   1.348e-10 1.161e-05 1.00
 Residual               1.436e-01 3.789e-01     
Number of obs: 2338, groups: Location, 17

Fixed effects:
              Estimate Std. Error t value
(Intercept)  5.001e-01  5.971e-02   8.376
StartDate   -6.975e-05  6.338e-06 -11.005

Correlation of Fixed Effects:
          (Intr)

StartDate -0.862
A parameter of interest is effect of StartDate. It is the size of the effect per day. This translates in 5% reduction per year. The t-value suggests this is significant, as does the associated ANOVA.
> (10^-6.975e-5)^365
[1] 0.9430642
ll2 <- lmer(LAmount ~  1 + (1|Location) +(StartDate|Location ) ,
    data=Fe)
anova(ll,ll2)
Data: Fe
Models:
ll2: LAmount ~ 1 + (1 | Location) + (StartDate | Location)
ll: LAmount ~ StartDate + (1 | Location) + (StartDate | Location)
    Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)   
ll2  6 2196.0 2230.5 -1092.0   2184.0                            
ll   7 2157.3 2197.6 -1071.7   2143.3 40.656      1  1.815e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A bootstrap simulation of this significance needs some more processing. It did not work unless I removed all data with missing values prior to calculation. The result was 0 out of 1000 simulations, which in hindsight is obvious. Even if I don't trust the ANOVA, I don't believe it would be a factor 10^7 off either.  

No comments:

Post a Comment