Sunday, January 27, 2013

European Fishing

I am playing around with Eurostat data and ggplot2 a bit more. As I progress it seems the plotting gets more easy, the data pre-processing a bit more simple and the surprises on the data stay.

Eurostat data

The data used are fish_fleet (number of ships) and fish_pr (production=catch+aquaculture). After a bit of year selection, 1992 and later, I decided to pull the data not as xls but as csv with formatting '1 234.56'. The consequence is that the data now comes as tall and skinny, which may actually be better. However, the actual number format and ':' for missing still make a bit of processing needed.
fleet <- read.csv("fish_fleet_1_Data.csv",colClasses=c(rep(NA,4),'character'))
fleet$Number <- scan(textConnection(gsub(' ', '',fleet$Value)),na.strings=':')
catch <- read.csv("fish_pr_00_1_Data.csv",colClasses=c(rep(NA,5),'character'))
catch$Tonnes <- scan(textConnection(gsub(' ', '',catch$Value)),na.strings=':')
Still need to make the GEO labels a bit shorter
shortlevels <- function(xx) {
  levels(xx) <- gsub('European Economic Area','EEA' ,levels(xx))
  levels(xx) <- gsub(' plus IS, LI, NO','+' ,levels(xx),fixed=TRUE)
  levels(xx) <- sub(' countries)',')' ,levels(xx),fixed=TRUE)
  levels(xx) <- sub(' (under United Nations Security Council Resolution 1244/99)','' ,levels(xx),fixed=TRUE)
  levels(xx) <- sub('European Free Trade Association','EFTA' ,levels(xx),fixed=TRUE)
  levels(xx) <- sub('Former Yugoslav Republic of Macedonia, the','FYROM' ,levels(xx),fixed=TRUE)
  levels(xx) <- sub('including +former GDR','Incl GDR' ,levels(xx))
  levels(xx) <- sub('European Union','EU' ,levels(xx))
  levels(xx)[grep('Germany',levels(xx))] <- 'Germany'
catch$GEO <- shortlevels(catch$GEO)
fleet$GEO <- shortlevels(fleet$GEO)

Plot about the fleet

Only preparation needed was to select Tonnage as property and use only countries. EFTA and EEA and EU have a number like 15, 25 or 27 in them
f2 <- fleet[grep('Tonnage',as.character(fleet$VESSIZE)) ,]
f2 <- f2[-grep('15|25|27',f2$GEO),]
f2 <- f2[complete.cases(f2),]
f2$VESSIZE <- factor(f2$VESSIZE)
f2$GEO <- factor(f2$GEO)
order levels of VESSIZE by value for a nice display
lev <- gsub('(-|\\+).*','',levels(f2$VESSIZE))
nlev <- as.numeric(gsub('^[[:alpha:]]* ','',lev))
f2$VESSIZE <- factor(as.character(f2$VESSIZE),
    levels= levels(f2$VESSIZE)[order(nlev,lev)])
levels(f2$VESSIZE) <-    gsub('Tonnage ','',levels(f2$VESSIZE))
First aim is a dotplot of the last year (2010). With countries ordered by size of fleet
f3 <- f2[f2$TIME==2010 ,]
f4 <- f3[f3$VESSIZE=='Total all Classes',]
f3$GEO <- factor(as.character(f3$GEO),
        aes(y=GEO,x=Number,colour=VESSIZE))  + 
    geom_point() +
It seems Greece had the largest fleet. All my thoughts that Netherlands was a fishing country have been erased. 
For a time related plot I chose to put the number of vessels on a logarithmic scale. As the number of countries is a bit large the biggest countries have been selected.
mfleet <- aggregate(f4$Number,list(GEO=f4$GEO),max)
bigfleet <- mfleet$GEO[mfleet$x>quantile(mfleet$x,1-9/nrow(mfleet))]
ggplot(f2[f2$GEO %in% bigfleet & f2$VESSIZE!='Total all Classes'  ,],
        aes(x=TIME,y=Number,colour=VESSIZE))  + 
    geom_line() +
    facet_wrap( ~ GEO, drop=TRUE) + 
    scale_y_log10()  +
The interesting thing about this plot is that the number of vessels is decreasing. That is, except for one category, the biggest, more than 2000 Tonnage, there are only a few tens of those, but they must count for loads of smaller vessels.


Fish caught is probably same thing. In this case, SPECIES and GEO have far too many levels for a decent display. So the biggest catches are shown. On top of that three SPECIES categories are almost the same. These are 'Total', 'Aquatic animals' and 'Finfish and invertebrates'. 
Finfish probably needs an explanation. To quote wikipediaMany types of aquatic animals commonly referred to as "fish" are not fish in the sense given above; examples include shellfishcuttlefish,starfishcrayfish and jellyfish. In earlier times, even biologists did not make a distinction – sixteenth century natural historians classified also seals, whales, amphibianscrocodiles, even hippopotamuses, as well as a host of aquatic invertebrates, as fish.[15] However, according the definition above, all mammals, including cetaceans like whales and dolphins, are not fish. In some contexts, especially in aquaculture, the true fish are referred to as finfish (or fin fish) to distinguish them from these other animals.
c2010 <- catch[catch$TIME==2010,] 
c2010 <- c2010[complete.cases(c2010),]
mcatch <- aggregate(c2010$Tonnes,list(GEO=c2010$GEO),max)
bigcatch <- mcatch$GEO[mcatch$x>quantile(mfleet$x,.5)]
c2010 <- c2010[c2010$GEO %in% bigcatch,]
mcatch <- aggregate(c2010$Tonnes,list(SPECIES=c2010$SPECIES),max)
bigcatch <- mcatch$SPECIES[mcatch$x>quantile(mcatch$x,.5)]
bigcatch <- bigcatch[!(bigcatch %in% 
          c('Aquatic animals','Finfish and invertebrates'))]
c2010 <- c2010[c2010$SPECIES %in% bigcatch,]
c2010$SPECIES <- factor(c2010$SPECIES)

        aes(y=GEO,x=Tonnes,colour=SPECIES))  + 
    geom_point() +
    labs(colour='Tonnes live weight')
The surprise here is Denmark. It is getting loads of fish. Same is true for UK, Spain

Combination of fleet and catch

Since we have both data sets, they can be combined. The merging id's are GEO and TIME, which means the data have to be transposed beforehand. The newly created variables have Number and Tonnes in the newly created variables, which are not needed for me.
tfl <- reshape(fleet,direction='wide',idvar=c('TIME','GEO'),
names(tfl) <- gsub('Number.','',names(tfl),fixed=TRUE)
rca <- reshape(catch,direction='wide',idvar=c('TIME','GEO'),
names(rca) <- gsub('Tonnes.','',names(rca),fixed=TRUE)
both <- merge(tfl,rca,id=c('TIME','GEO'))
both2 <- both[-grep('15|25|27',both$GEO),]
ggplot(both2[!(both2$GEO %in% c('Belgium','Bulgaria','Cyprus','Estonia',
        aes(y=`Total fishery products`,
            x=`Total all Tonnage Classes`,colour=TIME))  + 
    geom_point() +
    facet_wrap( ~ GEO, drop=TRUE)
I like very much how ggplot2 defaulted TIME as colour variable. It shows very nicely how catches and fleets are getting smaller. The latter obviously not true for the biggest ships as seen above. It is also shown that Denmark and Iceland have remarkably efficient fleets. Small but catching loads of fish. In contrast Greece has a big fleet but small catch. That does not seem economical, but tonnes do not equal Euro's. Regarding UK and Spain, yes the Spanish are just a bit bigger than the UK, so that pain may exist. 

Catch per species

As a final, I wanted to look per species. However, this would be a bit too long for this blog, so I only show one. It runs in a function, which just takes a bit of string from the SPECIES variable. To keep the plot simple only the six largest countries are taken. Facet_wrap does two things here. It puts a title even if there is only one species and makes separate panes if more than one value for species fits the string.
byspecies <- function(species) {
  ca <- catch[grep(species,catch$SPECIES,,c(-4,-5,-6)]
  ca <- ca[complete.cases(ca),]
  ca <- ca[!(ca$GEO %in% c('EFTA','EU (15)','EU (27)')),]
  ag <- aggregate(ca$Tonnes,list(GEO=ca$GEO),median)
  ag <- ag[order(-ag$x),]
  ca <- ca[ca$GEO %in% ag$GEO[1:6],]
  ca$GEO <- factor(ca$GEO)    
  ggplot(ca   ,   aes(y=Tonnes,x=TIME,colour=GEO))  + 
      geom_line() +
      facet_wrap( ~ SPECIES, drop=TRUE)


  1. Quite interesting post! The second last plot has number of boats on the x-axis, right? So it shows that Greece and Italy (still) has a large number of small boats. (I wonder if the Icelanders really have bothered to register small boats...) I take your comment on "efficiency" with a grain of salt, as tonnes fish per boat is not a very good measure of efficiency. Those small boats typically have a low operating cost in terms of fuel and man-hours (often being operated by a single person). Also they are only used for part of the year. A better measure would be tonnes fish per tonnes boat, which I guess you could calculate relatively easily? (assuming all boats withion a tonnage class to have some fixed weight) Would have been cool to see such a plot. (Even better, of course, would be value of fish caught for each euro spent fishing, including the cost of both man-hours and fuel, but that is harder to find...)

    1. yes, the second last post has number of boats on the x-axis.

      "I wonder if the Icelanders really have bothered to register small boats..". I suppose the Atlantic is a bit different from the Mediterranean. There may be smaller boats, but not the small ships just pulled up-shore in the evening.

      "Those small boats typically have a low operating cost in terms of fuel and man-hours (often being operated by a single person)." I think so too. On top of that they would operate in places not suitable for the big boats.

      I imagine the big factory ships get relative good profits. It is not for nothing their numbers grow.

  2. thanx for raising the awareness of this data site. been digging into it. from my perspective it would have been useful to include in the script the direct download to the data file. say this because the following script shows how to get at leas one format of the fleet data:

    URL <- ''
    TABLE <- 'fish_fleet.tsv.gz'

    fleet <- read.table('fish_fleet.tsv',header=TRUE,sep='\t',na.strings=c(':',': z'),strip.white=TRUE)
    names(fleet) <- substr(names(fleet),2,nchar(names(fleet)))
    names(fleet)[1] <- 'X'
    n <- str_locate(fleet$X,',')
    fleet$size <- str_sub(fleet$X,1,n[,1]-1)
    fleet$X <- str_sub(fleet$X,n[,1]+1)
    n <- str_locate(fleet$X,',')
    fleet$measure <- str_sub(fleet$X,1,n[,1]-1)
    fleet$cntry <- str_sub(fleet$X,n[,1]+1)
    fleet <- fleet[,2:ncol(fleet)]
    fleet <- melt(fleet,id.vars=c('size','measure','cntry'))
    names(fleet) <- c('size','measure','cntry','year','n')
    fleet <- fleet[!$n),]

    otherwise i will refrain from comments with regards to making inference about the data - for that one needs to have some background in both fisheries as well as fisheries science :-)


    1. You are very welcome. I saw a number of posts about open European data being available. Rather than repeating that, I decided to go and play with some. But I did not know about this general incantation to get data, nor about gunzip in R.utils and stringr, so thanks for those. I agree with you, it is most ideal if the script includes the download. Now that I know it is possible, I will use that.