Sunday, September 21, 2014

Trying dplyr on triathon data

There was a triathlon in Almere last week, like every year since 1983. I pulled the data of all years to get some idea how things have changed in that sport. To get a visual I decided to plot the best 10% athletes. Then later I decided this was an ideal moment to look at plyr and dplyr again, so rewrote everything using those tools. I must say I like them very much and intend to use them again.

Data

Data from triathlon-uitslagen.nl, this years website www.challenge-almere.com and wikipedia which told me there was one lady in 1983. The data is in a bunch of spreadsheets (.xls and .xlsx), with different columns used, detail level and even those merged cells which make things look nice. Some years had besides finishers also those who did not, broken down by time of certain milestones. Some years had teams next to individuals. It was a bit too much variation to be coded in R, I did my first cleaning, removing spurious columns and rows, in libreoffice and made relatively uniform .xls files.
During importing R coded most times as date times some as character. The date times had a date (1899-12-31), so I made time to difftimes and retained the original as character. The actual year of the match is pulled from the spreadsheet name.
library(XLConnect)
library(plyr)
library(dplyr)
library(lattice)
setwd('C:\\Users\\Kees\\Documents\\r hobby\\almeretriatlon')
dd <- dir(,'.xls')

readyear <- function(filename) {
  print(filename)
  wb <- loadWorkbook(filename)
  rw <- readWorksheet(wb,sheet=1)
  names(rw) <- tolower(names(rw))
  names(rw)[names(rw)=='total'] <- 'time'
  mutate(rw,
      timec=if (class(rw$time)[1]=='character') {
            time
          } else format(time,'%H:%M:%S') ,
      time=as.difftime(timec),
      year=as.numeric(gsub('[A-Za-z\\. ]','',filename))
  )
}
years0 <- lapply(dd,readyear)
The resulting data may have a gender variable, and/or, a code variable. The code variable may be character and contain gender plus age bracket as digits (or team, or PRO or ELI), but it can also only contain the age bracket and hence be numerical. Gender can be D, H, M, V, W, interpreted as dame, heer, man, vrouw, woman respectively (female, male, male, female, female).
For plotting purposes a two digit Year factor was added, running from 83 to 14.
triatlon <- ldply(years0,function(x) {
# first unify cat info & gender 
          x <- if ('cat' %in% names(x)) {
                if (class(x$cat)[1]=='character') {
                  mutate(x,
                      gender=substr(cat,1,1),
                      cat=substring(cat,2))
                } else  mutate(x,
                      cat=as.character(cat))
              } else mutate(x,cat=NA)
        }) %>% 
    filter(.,!(gender %in% c('T','P'))) %>%
    mutate(.,
        timen=as.numeric(time) , 
        gender=factor(
            ifelse(gender %in% c('H','M'),
                'Male','Female')) ,
        Year=factor(substr(format(year),3,4),
            levels=substr(format(
                    seq(min(year),max(year))),3,4))) %>%
    select(.,
        gender,
        cat,
        year,
        Year,
        time,
        timec,
        timen)  

Plot 

This is actually where plyr/dplyr worked best to get nice clean code, it was as if it was made for this purpose, define groups, define selection and plot. The choice for lattice plots was made before I went to recode and I did not feel like changing it.
The plot shows times steadily decreasing through the previous century and stabilizing around turn of the century. 2006 was an European championship and 2008 world championship, and these years do dot stick out as being faster. Hence these times should be fairly representative of what can be achieved in the Netherlands. There is quite some year to year variation, but that can be attributed to weather and tracks used.  
group_by(triatlon,year,gender) %>% 
    filter(.,percent_rank(time)<0.1) %>%
    bwplot(timen ~ Year | gender,
        data=.,
        ylab='Time (h)',
        xlab='Year',
        horizon=FALSE,
        layout=c(1,2,1),
        drop.unused.levels=FALSE,
        scales=list(cex=.7))

2 comments:

  1. Hi,
    Nice post.
    How can I download that file from website?
    Many thanks.

    ReplyDelete
    Replies
    1. I am sorry. I don't want to go and host files myself. The data are all on http://www.triathlon-uitslagen.nl/ by now, but need quite a bit of processing before they are actually computer readable.

      Delete