The data
As described before; the data are from KNMI; ROYAL NETHERLANDS METEOROLOGICAL INSTITUTE. The actual page for the data is http://www.knmi.nl/klimatologie/monv/reeksen/select_rr.html. The data selected were from 6 locations; De Bilt, Groningen, Heerde, Kerkwerve, Ter Apel and Winterswijk (Sibinkweg). The blog starts with data entered in a data.frame named all which was derived last week.
The problems with the data
The data is certainly not normal. Most days have no rain. The data is also location correlated. It is improbable to have beautiful weather in one location and loads of rain 200 km, possibly with the exception of summer, when thunderstorms may be highly local. It is also not often to have nice weather one day and loads of rain the next. For those not living here, examine this plot of 1906 data.
library(ggplot2)
Y1906 <- all[all$year==1906,]
c1 <- ggplot(Y1906, aes(x=day, y=RD,col=location,linetype=location))
c1 + geom_line() +
facet_wrap( ~ Month) +
theme(legend.position = "bottom") +
labs(col = " ",linetype=' ')
Analysis
There are many ways to analyze the data for differences. In this case I wanted to compare the first and last 10 years of data. A non-parametric method is preferred because the non-normality That leaves all the other problems, correlations between observations. To avoid these I tried to eliminate data. Just one location avoids correlation between locations. Data every sixth day avoids time correlation. De Bilt is chosen as location because it is centre of the country and home of the KNMI. Finally, January, because only one month seems most opportune. It seems quite crude when you look at it.
sel1 <- all[(all$year <1916 | all$year>2003)
& all$Month=='Jan'
& all$location=='DE-BILT'
& all$day %in% seq(1,31,by=5),]
sel1$decade <- factor(c('1906-1915','2004-2012')[1+(sel1$year>1950)])
Packige coin has a one_way function which seems suitable. Distribution approximate makes it do an actual Monte Carlo resampling, and the result is no difference.
oneway_test(RD ~ decade,data=sel1,distribution='approximate')
Approximative 2-Sample Permutation Test
data: RD by decade (1906-1915, 2004-2013)
Z = -0.6353, p-value = 0.535
alternative hypothesis: true mu is not equal to 0
Oneway_test tests for location,Kolmogorov_Smirnov tests for general differences. The assumption is that the data is continuous, not so sure that holds. Result; p-value close to 0.05.
ks.test(sel1$RD[sel1$decade=='1906-1915'],
sel1$RD[sel1$decade=='2004-2013'])
Two-sample Kolmogorov-Smirnov test
data: sel1$RD[sel1$decade == "1906-1915"] and sel1$RD[sel1$decade == "2004-2013"]
D = 0.2286, p-value = 0.05161
alternative hypothesis: two-sided
The question is why there are differences? After plotting, it seems the number of days without rain changes. The latter seems to show a significant difference.
c1 <- ggplot(sel1, aes(x=RD/10,col=decade))
c1 + geom_density() + xlab('mm rain/day')
xtabs(~ decade + factor(RD==0),data=sel1)
factor(RD == 0)
decade FALSE TRUE
1906-1915 37 33
2004-2013 53 17
decade FALSE TRUE
1906-1915 37 33
2004-2013 53 17
prop.test( xtabs(~ decade + factor(RD==0) ,data=sel1))
2-sample test for equality of proportions with continuity correction
data: xtabs(~decade + factor(RD == 0), data = sel1)
X-squared = 7, df = 1, p-value = 0.008151
alternative hypothesis: two.sided
95 percent confidence interval:
-0.3970179 -0.0601250
sample estimates:
prop 1 prop 2
0.5285714 0.7571429
Check with December data
Since January was used to select the analysis method, ideally it should not be used for the p-value. There is data enough; December turns out to show a p-value of 0.005.
sel2 <- all[(all$year <1916 | all$year>2002)
& all$Month=='Dec'
& all$location=='DE-BILT'
& all$day %in% seq(1,31,by=5),]
sel2$decade <- factor(c('1906-1915','2002-2012')[1+(sel2$year>1950)])
prop.test( xtabs(~ decade + factor(RD==0) ,data=sel2))
2-sample test for equality of proportions with continuity correction
data: xtabs(~decade + factor(RD == 0), data = sel2)
X-squared = 7.7712, df = 1, p-value = 0.005309
alternative hypothesis: two.sided
95 percent confidence interval:
-0.36463362 -0.06393781
sample estimates:
prop 1 prop 2
0.6571429 0.8714286
No comments:
Post a Comment