Sunday, November 29, 2015

Wind in Netherlands II

Two weeks ago I plotted how wind measurements on the edge of the North Sea changed in the past century. This week the same dataset is used for hypothesis testing.

Data

The most important things to reiterate from previous post is that the data is from KNMI and they come with a comment: "These time series are inhomogeneous because of station relocations and changes in observation techniques. As a result, these series are not suitable for trend analysis. For climate change studies we refer to the homogenized series of monthly temperatures of De Bilt or the Central Netherlands Temperature"
Data reading has slighlty changed, mostly because I needed different variables. In addition, for testing I wanted some categorical variables, these are Month and year. For year I have chosen five chunks of 22 years, 22 was chosen since it seemed large enough and resulted in approximately equal size chunks. Finally, for display purposes, wind direction was categorized in 8 directions according to the compass rose (North, North-East, East etc.).
library(circular)
library(dplyr)
library(ggplot2)
library(WRS2)

r1 <- readLines('etmgeg_235.txt')
r2 <- r1[grep('^#',r1):length(r1)]
explain <- r1[1:(grep('^#',r1)-1)]
explain
r2 <- gsub('#','',r2)
r3 <- read.csv(text=r2)

r4 <- mutate(r3,
        Date=as.Date(format(YYYYMMDD),format='%Y%m%d'),
        year=floor(YYYYMMDD/1e4),
        rDDVEC=as.circular(DDVEC,units='degrees',template='geographics'),
        # Vector mean wind direction in degrees 
        # (360=north, 90=east, 180=south, 270=west, 0=calm/variable)
        DDVECf=as.character(cut(DDVEC,breaks=c(0,seq(15,330,45),361),left=TRUE,
            labels=c('N','NE','E','SE','S','SW','W','NW','N2'))),
        DDVECf=ifelse(DDVECf=='N2','N',DDVECf),
        DDVECf=factor(DDVECf,levels=c('N','NE','E','SE','S','SW','W','NW')),
        rFHVEC=FHVEC/10, # Vector mean windspeed (in 0.1 m/s)
        yearf=cut(year,seq(1905,2015,22),labels=c('05','27','49','71','93')),
        month=factor(format(Date,'%B'),levels=month.name),
        tcat=interaction(month,yearf)
    ) %>%
    select(.,YYYYMMDD,Date,year,month,DDVEC,rDDVEC,DDVECf,rFHVEC,yearf,tcat)

Analysis

The circular package comes with an aov.circular() function, which can do one way analysis. Since I am a firm believer that direction varies according to the seasons, the presence of a time effect (the five categories) has been examined by Month. To make result compact, only p-values are displayed, they are all significant.
sapply(month.name,function(x) {
      aa <- filter(r4,month==x)
      bb <- aov.circular(aa$rDDVEC,aa$yearf,method='F.test')
      format.pval(bb$p.value,digits=4,eps=1e-5)
    }) %>% as.data.frame
January   4.633e-05
February    < 1e-05
March       < 1e-05
April       < 1e-05
May         < 1e-05
June        0.00121
July       0.000726
August    0.0001453
September   0.02316
October     < 1e-05
November  0.0001511
December   0.003236
The associated plot with this data shows frequency of directions by year and Month. The advantage here being that the time axis is the x-axis, so changes are more easily visible.
ggplot(r4[complete.cases(r4),], aes(x=yearf))+
    geom_histogram()+
    facet_grid(DDVECf ~ month)+
    ggtitle('Frequency of Wind Direction')

The other part of wind is strength. Two weeks ago I saw clear differences. However, since that may also be effect of instrument or location change. The test I am interested here is therefore not the main effect of year categories but rather the interaction Month*Year. In the objective of robustness I wanted to go nonparametric with this. However, since I did not find anything regarding two factor interaction in my second edition of Hollander and Wolfe I googled for robust interaction. This gave a hit on rcompanion for the WRS2 package.
t2way(rFHVEC ~ yearf + month + yearf:month, 
    data = r4)
                value p.value
yearf       1063.0473   0.001
month        767.5687   0.001
yearf:month  169.4807   0.001

Conclusion

The data seems to show a change in wind measurements over these 110 years. This can be due to changes in wind or measurement instrument or instrument location. The statistical testing was chosen such as to counter some effects of these changes, hence it can be thought that the change is due to changes in wind itself.

Sunday, November 15, 2015

Wind in Netherlands

In climate change discussions, everybody talks about temperature. But weather is much more than that. There is at least rain and wind as directly experienced quality, and air pressure as measurable quantity. In the Netherlands, some observation stations have more than a century of daily data on these things. The data may be broken in the sense that equipment and location can have changed. To quote: "These time series are inhomogeneous because of station relocations and changes in observation techniques. As a result, these series are not suitable for trend analysis. For climate change studies we refer to the homogenized series of monthly temperatures of De Bilt link or the Central Netherlands Temperature link." Since I am not looking at temperature but wind, I will keep to this station's data.

Data

Data are from daily observations from KNMI. I have chosen station De Kooy. For those less familiar with Dutch geography, this is close to Den Helder, in the tip North West of Netherlands. This means pretty close to the North Sea, Wadden Sea and Lake IJssel. Wind should be relatively unhindered there. The data themselves are daily observations. For wind there are:
DDVEC     Vector mean wind direction in degrees
                  (360=north, 90=east, 180=south, 270=west, 0=calm/variable)
FHVEC     Vector mean windspeed (in 0.1 m/s)
FG             Daily mean windspeed (in 0.1 m/s)
FHX          Maximum hourly mean windspeed (in 0.1 m/s)
FHXH       Hourly division in which FHX was measured
FHN          Minimum hourly mean windspeed (in 0.1 m/s)
FHNH       Hourly division in which FHN was measured
FXX          Maximum wind gust (in 0.1 m/s)
FXXH       Hourly division in which FXX was measured
The header of the data downloaded contains this, and much more information. I am sure there are good reasons to do speed in 0.1 m/s, but personally I find m/s more easy.
The two first variables are 'vector means'. It is obvious that one cannot simply average directions. Luckily there is the circular package, which does understand direction.
Thus the data reading script becomes:
r1 <- readLines('etmgeg_235.txt')
r2 <- r1[grep('^#',r1):length(r1)]
explain <- r1[1:(grep('^#',r1)-1)]
# explain
r2 <- gsub('#','',r2)
r3 <- read.csv(text=r2)
library(dplyr)
library(circular)
methods(sd)
r4 <- mutate(r3,
    Date=as.Date(format(YYYYMMDD),format='%Y%m%d'),
    year=floor(YYYYMMDD/1e4),
    month=factor(format(Date,'%B'),levels=month.name),
    rDDVEC=as.circular(DDVEC,units='degrees',template='geographics'),
              # Vector mean wind direction in degrees 
              # (360=north, 90=east, 180=south, 270=west, 0=calm/variable)
    rFHVEC=FHVEC/10, # Vector mean windspeed (in 0.1 m/s)
    rFG=FG/10,   # Daily mean windspeed (in 0.1 m/s) 
    rFHX=FHX/10, # Maximum hourly mean windspeed (in 0.1 m/s)
    rFHN=FHN/10, # Minimum hourly mean windspeed (in 0.1 m/s)
    rFXX=FXX/10 # Maximum wind gust (in 0.1 m/s)
    ) %>%
    select(.,YYYYMMDD,Date,year,month,rDDVEC,rFHVEC,
        rFG,rFHX,rFHN,rFXX)

Plots

Plot of mean wind speed shows several effects. There is an equipment change just before year 2000. At the beginning of the curve the values are lowest, while in the sixties there is a bit more wind, as was n the nineties. I wonder about that. Is that equipment? I can imagine that hundred years ago there was lesser equipment giving such a change, but fifty or twenty years ago? Finally, close to the end of the war there is missing data.
ggplot(data=r4,aes(y=rFG,x=Date))+
    geom_smooth()+
    geom_point(alpha=.03) +
    ylab('Mean wind speed x (m/s)')+
    xlab('Year')

A second plot is by month. This shows somewhat different patterns. There is still most wind in the middle of last century. However, September and October have the most wind just before 1950, while November and December have most wind after 1950. Such a pattern cannot be attributed to changes in equipment. It would seem there is some kind of change in wind speeds then.
r5 <- group_by(r4,month,year) %>%
    summarise(.,mFG=mean(rFG),mFHX=max(rFHX),mFXX=max(rFXX))
ggplot(data=r5,aes(y=mFG,x=year)) +
    geom_smooth(method='loess') +
    geom_point(alpha=.5)+
    facet_wrap(~ month)
 

Wind direction

In the Netherlands there is a clear connection between wind and the remainder of the weather. Most of the wind is from the SW (south west, I will be using N, E, S, W to abbreviate directions from here on). N, NW, W and SW winds take humidity from the North Sea and Atlantic Ocean, which in turn will bring rain. In winter, the SW wind will also bring warmth, there will be no frost with W and SW wind. In contrast, N, NE and E will bring cold. A winter wind from Siberia will bring skating fever. In summer, the nice and sunny weather is associated with S to E winds the E wind in May is associated with nice spring weather. SE is by far the least common direction. 
The circular package has a both density and plot functions. Combining these gets the following directions for the oldest part of the data. 
par(mfrow=c(3,4),mar=c(0,0,3,0))
lapply(month.name,function(x) {
      xx <- r4$rDDVEC[r4$year<1921 & r4$month==x]
      xx <- xx[!is.na(xx)]
      density(xx,bw=50)  %>% 
          plot(main=x,xlab='',ylab='',shrink=1.2)
      1
    })
title("1906-1920", line = -1, outer = TRUE)

I would be hard pressed to see significant differences between old and recent data. The densities are slightly different, but not really impressive. Note the lack of E wind in summer, indicating that recent summers have been not been very spectacular. 
par(mfrow=c(3,4),mar=c(0,0,2,0))
lapply(month.name,function(x) {
      xx <- r4$rDDVEC[r4$year>=2000 & r4$month==x]
      xx <- xx[!is.na(xx)]
      density(xx,bw=50)  %>% 
          plot(main=x,xlab='',ylab='',shrink=1.2)
      1
    })
title("2000-now", line = -1, outer = TRUE)

Sunday, November 1, 2015

Vacancies in Europe

I like playing around with data from Eurostat. At this time the tools to do so are just so easy. There are tools to pull the data directly from the data base in R (eurostat package). Process it a bit using dplyr and before you know it, ggplot makes a plot.

Data

My starting point to examine data is the database page. From there I can browse for the correct table and view its contents. Having done that, I can take the name of the table and pull that in R. The name of the vacancy database I chose (Job vacancy statistics - quarterly data (from 2001 onwards), NACE Rev. 2) is jvs_q_nace2, hence with
library(eurostat)
library(dplyr)
library(ggplot2)
library(scales)
r1 <- get_eurostat('jvs_q_nace2')
I have all packages needed and the data in R. One of the properties of the data is that everything is coded. Hence the next step is to merge the codes. The following code pulls the country codes and does a bit of post processing on the names to get them a bit nicer. Subsequently, the variously combinations of countries determined by expanding of the EU and Euro area at various time points are removed. These data have the property that they are too abundant, some data removal is needed. Finally, seasonably adjusted data is selected and all company sizes are used.
# add country names
r2 <- get_eurostat_dic('geo') %>%
    mutate(.,
        geo=V1,
        country=V2,
        country=gsub('\\(.*$','',country),
        country=gsub(' $','',country)) %>%
    select(.,geo,country) %>%
    merge(.,r1) %>%
# filter countries
    filter(.,
        !grepl('EA.*',geo),
        !grepl('EU.*',geo),
        s_adj=='SA'  ,# seas. adj.
        sizeclas!='Total') %>% # all company sizes
    mutate(.,country=factor(country)) %>%
    select(.,-geo,-s_adj,-sizeclas)
For other variables, it is more or less the same. get_eurostat_dic() pulls the coding and they can be merged. The text in nace is a bit long, so I shortened it.
r3 <- get_eurostat_dic('nace_r2') %>%
    rename(.,
        nace_r2=V1, # add NACE 
        nace=V2) %>%
    mutate(.,
        nace=substr(nace,1,110)) %>%
    merge(.,r2) %>%
    mutate(nace=factor(nace))

r4 <- get_eurostat_dic('indic_em') %>%
    rename(.,
        indic_em=V1) %>%
    merge(.,r3) %>%
    mutate(.,
        property=factor(V2)) %>%
    select(.,-V2)

Plots

Since the data is now prepared, the next step is to plot. There are actually far too many categories in nace and a selection to be displayed is needed. If you want know what different categories are, use
nace <- select(r4,nace_r2,nace) %>% unique() 
to display what each category represents. I chose to select a number of industry related categories. In addition some countries have very limited data, they are eliminated.
filter(r4,
        property=='Job vacancy rate',
        nace_r2 %in% c('A-S','B-E','B-S','B-F'),
        !(country %in% 
              c('Croatia', 'Greece','Portugal',# limited years
                  'Switzerland')),             # limited classes
        time>as.Date('01-01-2006',format='%d-%m-%Y'),
        !is.na(values)) %>%
    mutate(.,country=factor(country)
        ,nace_r2=factor(nace_r2)) %>%
    
    ggplot(.,aes(x=time,y=values,color=nace)) +
    geom_line() +
    facet_wrap(  ~ country )+
    ylab('Job vacancy rate')+
    guides(color=guide_legend(ncol=1))+
    scale_x_date(labels=date_format("%y"))+
    xlab('Year')+
    theme(legend.position="bottom", legend.title=element_blank())
In the plot the enormous drops for Cyprus, Czech Republic and Estonia are clearly visible. The Czech Republic is also rebounding quite steeply. UK had a smaller drop in 2008, but is now at pre-crisis job vacancy rates. In fact many countries show increases in job vacancy rate.

Getting a different display is just very easy. Below the call to get number of vacancies in education, information and communication and research. Since the number of vacancies is really dependent on country size, a logarithmic scale is chosen. The countries displayed are slightly different, it appears not all countries have all data. But the trends are similar as the previous plot.

filter(r4,
        property=='Number of job vacancies',
        nace_r2 %in% c('J','M','M_N','P'),
        !(country %in% # limited years
              c('Croatia', 'Greece','Portugal','Sweden')),
        !is.na(values)) %>%
    mutate(.,country=factor(country)
        ,nace_r2=factor(nace_r2)) %>%
    
    ggplot(.,aes(x=time,y=values,color=nace )) +
    geom_line() +
    facet_wrap(  ~ country )+
    ylab('Number of job vacancies')+
    guides(color=guide_legend(ncol=1))+
    scale_x_date(labels=date_format("%y"))+
    xlab('Year')+
    scale_y_log10()+
    theme(legend.position="bottom", legend.title=element_blank())