Data
The data is in a spreadsheet, one page per location. There is quite a big header on each page and each data column has an associated column marking data below detection limit. This being real world data there are also the odd missing data.Following Read Excel Data from R I chose to read the data via XLConnect. The whole process is wrapped in a function, So I can read all pages (all locations).
library(XLConnect)
wb = loadWorkbook("lmre_1992-2005.xls")
sheets <- getSheets(wb)[-1]
readsheet <- function(cursheet) {
df = readWorksheet(wb, sheet = cursheet , header = TRUE)
topline <- 8
test <- which(df[6,]=='C')+1
numin <- df[topline:nrow(df),test]
names(numin) <- make.names( gsub(' ','',df[6,test]))
for(i in 1:ncol(numin)) numin[,i] <- as.numeric(gsub(',','.',numin[,i]))
status = df[topline:nrow(df),test-1]
names(status) <- paste('C',make.names( gsub(' ','',df[6,test])),sep='')
numin$StartDate <- as.Date(substr(df[topline:nrow(df),1],1,10))
numin$EndDate <- as.Date(substr(df[topline:nrow(df),2],1,10))
numin <- cbind(numin,status)
tall <- reshape(numin,
varying=list(make.names( gsub(' ','',df[6,test])),
paste('C',make.names( gsub(' ','',df[6,test])),sep='')),
v.names=c('Amount','Status'),
timevar='Compound',
idvar=c('StartDate','EndDate'),
times=paste(df[6,test],'(',df[5,test],')',sep=''),
direction='long')
tall$Compound <- factor(gsub(' )',')',gsub(' +',' ',tall$Compound)))
row.names(tall)<- NULL
tall$Location <- cursheet
tall
}
tt <- lapply(sheets,readsheet)
tt2 <- do.call(rbind,tt)
tt2$Location <- factor(tt2$Location)
Plots
There are certainly many variables to be analyzed, more than I would do in a blog post. I have chosen pH (acid rain) and Iron (Fe).
levels(tt2$Compound)
[1] "As (umol/l)" "Ca (umol/l)" "Cd (umol/l)"
[4] "Cl (umol/l)" "Co (umol/l)" "Cr (umol/l)"
[7] "Cu (umol/l)" "F (umol/l)" "Fe (umol/l)"
[10] "K (umol/l)" "K25 (uS/cm)" "massa_hc (g)"
[4] "Cl (umol/l)" "Co (umol/l)" "Cr (umol/l)"
[7] "Cu (umol/l)" "F (umol/l)" "Fe (umol/l)"
[10] "K (umol/l)" "K25 (uS/cm)" "massa_hc (g)"
[13] "massa_zm (g)" "Mg (umol/l)" "Mn (umol/l)"
[16] "Na (umol/l)" "NH4 (umol/l)" "Ni (umol/l)"
[19] "NO3 (umol/l)" "nsl (mm)" "Pb (umol/l)"
[22] "pH (pH-eenheid)" "PO4 (umol/l)" "SO4 (umol/l)"
[16] "Na (umol/l)" "NH4 (umol/l)" "Ni (umol/l)"
[19] "NO3 (umol/l)" "nsl (mm)" "Pb (umol/l)"
[22] "pH (pH-eenheid)" "PO4 (umol/l)" "SO4 (umol/l)"
[25] "st.zr. (umol/l)" "V (umol/l)" "Zn (umol/l)"
pH
Acid rain was quite a scare last century. We should be able to see progress, especially in the nineties, when there was quite some activity on it. One location is not shown; Leiduin has data complimentary to 'De Zilk'. Removal of this location gave a much nicer layout for the remaining data.
The plots show progress, but some locations (e.g. De Zilk, Rotterdam) there is still a way to go. Is Vredepeel going more acid again? I may need to add more data after all.
library(ggplot2)
library(scales)
(cc <- levels(tt2$Compound)[22])
pp <- ggplot(tt2[tt2$Compound==cc & tt2$Location!='Leiduin',],
aes( StartDate,Amount))
pp + geom_point() +
facet_wrap(~Location) +
ylab(cc) + xlab('Year') +
scale_x_date(breaks='4 years',labels = date_format("%Y"))
Iron
Iron is plotted on a logarithmic y scale. Disadvantage is that this hides extreme high (Kollumerwaard) and below zero's. However, on this scale the data seem more or less 'normal'.
(cc <- levels(tt2$Compound)[9])
pp <- ggplot(tt2[tt2$Compound==cc & tt2$Location!='Leiduin',], aes( StartDate,Amount))
pp + geom_point() +
scale_y_log10() +
facet_wrap(~Location) +
ylab(cc) + xlab('Year') +
scale_x_date(breaks='4 years',labels = date_format("%Y"))
Mixed model for Iron
Just to get some quantification on the reduction I ran a model. The obvious choice is again to use logarithmic transformed data. However, some zeros unleashed havoc on the subsequent estimates. Hence these data are made NA prior to computations.library(lme4)
Fe <- tt2[tt2$Compound==levels(tt2$Compound)[9],]
# plot(density(Fe$Amount[!is.na(Fe$Status) & Fe$Status!=0],adjust=.5))
Fe$LAmount <- log10(Fe$Amount)
Fe$LAmount[Fe$Amount<=0] <- NA
ll <- lmer(LAmount ~ StartDate + (1|Location) +(StartDate|Location ) ,
data=Fe)
summary(ll)
Linear mixed model fit by REML ['lmerMod']
Formula: LAmount ~ StartDate + (1 | Location) +
(StartDate | Location)
Data: Fe
REML criterion at convergence: 2170.498
Random effects:
Groups Name Variance Std.Dev. Corr
Location (Intercept) 4.727e-11 6.875e-06
Location.1 (Intercept) 1.456e-04 1.207e-02
StartDate 1.348e-10 1.161e-05 1.00
Residual 1.436e-01 3.789e-01
Number of obs: 2338, groups: Location, 17
Fixed effects:
Estimate Std. Error t value
(Intercept) 5.001e-01 5.971e-02 8.376
StartDate -6.975e-05 6.338e-06 -11.005
Correlation of Fixed Effects:
(Intr)
StartDate -0.862
A parameter of interest is effect of StartDate. It is the size of the effect per day. This translates in 5% reduction per year. The t-value suggests this is significant, as does the associated ANOVA.
> (10^-6.975e-5)^365
[1] 0.9430642
ll2 <- lmer(LAmount ~ 1 + (1|Location) +(StartDate|Location ) ,
data=Fe)
anova(ll,ll2)
Data: Fe
Models:
ll2: LAmount ~ 1 + (1 | Location) + (StartDate | Location)
ll: LAmount ~ StartDate + (1 | Location) + (StartDate | Location)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
ll2 6 2196.0 2230.5 -1092.0 2184.0
ll 7 2157.3 2197.6 -1071.7 2143.3 40.656 1 1.815e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A bootstrap simulation of this significance needs some more processing. It did not work unless I removed all data with missing values prior to calculation. The result was 0 out of 1000 simulations, which in hindsight is obvious. Even if I don't trust the ANOVA, I don't believe it would be a factor 10^7 off either.
No comments:
Post a Comment