Wiekvoet: Trying to optimize

I wanted to try some more machine learning. On Kaggle there is a competition How Much Did It Rain? II. This is quite a bigger data set than Titanic. To quote from Kaggle:
Rainfall is highly variable across space and time, making it notoriously tricky to measure. Rain gauges can be an effective measurement tool for a specific location, but it is impossible to have them everywhere. In order to have widespread coverage, data from weather radars is used to estimate rainfall nationwide. Unfortunately, these predictions never exactly match the measurements taken using rain gauges.

Data

On the data themselves:
To understand the data, you have to realize that there are multiple radar observations over the course of an hour, and only one gauge observation (the 'Expected'). That is why there are multiple rows with the same 'Id'.
I have downloaded the data and at this point am just experimenting with them. It is quite a big data set: there are 9125329 rows in the training set. My idea was to do 'something' per record, combine the records of one hour to get a prediction. The 'something' is as yet undefined. The idea to combine by Id is supposed to be retained.
What became clear pretty quickly is that everything is slow with this amount of data. Hence for now I will use only 10% of the training data. For ease of access the data are sitting in a R data set.
load('aaa3.RData')
###
# take 10% of data
rawdata <- rawdata[rawdata$Id < quantile(rawdata$Id,.1),]
# extract keys per hour . Id & Expected
# rawdata$Id!=c(0,rawdata$Id[1:(nrow(rawdata)-1)
# is the R way to write the SAS code: by Id; If first.Id;
r1b <- rawdata[rawdata$Id!=c(0,rawdata$Id[1:(nrow(rawdata)-1)]),c(1,24)]
Id <- factor(rawdata$Id)

Model

To get but to get an idea of the process I started with linear regression, but that is just a temporary approach. For linear regression there are 22 parameters, for 21 observed values and an intercept. Prediction per row follows from a simple matrix multiplication. The model including estimation of the error in the fit sits in small R function. As preparation a column of ones is added to the x data. The summary per Id can be done pretty quickly and easy via the group_by() and summarise() functions from dplyr.
Based on the current results I have decided that such a function will have to be transferred to C++ or such in order to have a decent computation time. But that is for a future time, it has been quite some years that I programmed in C or Fortran, I'll need a refresher first, luckily edX has a course 'Introduction to C++' running right now.
r1m <- as.matrix(rawdata[,c(-1,-2,-24)])
rm(rawdata) # control memory usage
r1m <- cbind(rep(1,nrow(r1m)),r1m) # add column of 1
r1m[is.na(r1m) ] <- 0
betas <- rep(1,ncol(r1m))
#myerr calculated mean prediction per Id and compares with Expected values
myerr <- function(betas) {
pred <- data.frame(Id=Id,
pred=as.numeric(tcrossprod(betas,r1m))) %>%
group_by(.,Id) %>%
summarise(.,m=mean(pred))
sum(abs(pred$m-r1b$Expected))
}
#mmyerr is myerr for maximization
mmyerr <- function(betas) -myerr(betas)

Parameter estimation & optimization

The problem has now been reduced to getting the parameters which give the lowest prediction error. just throwing this in optim() did not lead to satisfactory results. So this post gives some experiments with alternate approaches. So, I played with some of these, and for this post ran it all to get decent data. The table shows the quick summary.

package	function	time	result	converged
stats	optim	771	1805459	No
	optim (BFGS)	729	1678736	Yes
	optim(CG)	5527	1678775	Yes
adagio	simpleEA	623	1722928	No
dfoptim	hjk	4289	1678734	Yes
GA	ga	589	1775910	NA

3 comments:

Hakki1320 October, 2015 10:39
First of all, Thank You for great work you do. This has been awesome blog to read and try to learn. Still pretty new to R but making some progress. I really like these SAS Manual examples, don't know SAS but they are "real" world examples and easier to understand. Can I throw one example, that I would love to see? Example 27.14 Factor Model: Now-Casting the US Economy. I have been trying to do that but haven't made real process, any pointers or awesome presentation on it would be great. Keep up excellent work. -Hakki
Hakki1320 February, 2016 07:13
Fair enough. Hopefully you still keep your blog alive.

Sunday, October 18, 2015

Trying to optimize

Data

Model

Parameter estimation & optimization

3 comments: