Monday, April 23, 2012

Tuning GAMBoost

This post describes some of the simulation results which I obtained with the GAMBoost package. The aim of these simulations is to get a feel what I should tune and what I should not tune with GAMBoost. 

Setup

In the GAMBoost package one can tune quite a number of parameters. I have looked at tuning bdeg (degree of the B-spline basis to be used for fitting smooth functions, values 1, 2 and 3), penalty (penalty value for the update of an individual smooth function in each boosting step, values 10, 100 and 1000) when using cv.GAMBoost to determine the optimum number of Boosting steps. As a second part, optimGAMBoostPenalty was used as a less computational burdening technique.


All data had 11 independent variables. 
The X matrix was filled with normal distributed random data. The number of objects in the training set was 25, 50, 100 or 200. The test set had 25 objects and was used to get a measure for Mean Squared Error of Prediction (RMSEP).
Three data generating models have been used, linear, quadratic and complex. 
y <- xmat[,1]*1 + rnorm(nrow)
y <- xmat[,1]*1 + xmat[,2]^2 + 5*sqrt(3.5+xmat[,3]) + rnorm(nrow)

y <- xmat[,1]*1 + xmat[,2]^2 + 5*sqrt(3.5+xmat[,3]) + 6*sin(3*xmat[,4] + 4*cos(6*xmat[,5])) + rnorm(nrow)
Each simulation was run 3 times, to get a feel of the variation in the results.

Key Learning

  • cv.GAMBoost will provide a reasonable model, independent of penalty or bdeg used. 
  • Calculation time for cross validation is quite big and is easily prohibitive.
  • optimGAMBoostPenalty gives a faster result, but also bigger RMSEP.
  • Overly complex relations are not so easily captured.
  • GAMBoost needs quite a large number of training objects to perform.

Detailed Results

  • 25 training objects was clearly not enough. 50 training objects was just not enough. 
    • When the training set contains 25 objects, the variation in RMSEP between the three repeats was quite big. This suggests that the result was dependent on the realization of the random x matrix. This effect is much less with a 50 object training set. 
  • 100 to 200 training objects was enough for the linear generated response
    • With 100 and 200 objects in the training set the linear generated response resulted in RMSEP of 1 to 1.3. This is close to expected.
  • 100 to 200 training objects was enough for quadratic generated response. But the RMSEP was higher than linear generated response.
    • With 100 to 200 objects RMSE of 1.2 to 1.6. 
    • 25 to 100 objects in training set RMSEP of 1.8 to 2.2
  • The complex model was never well predicted
    • RMSEP of 4 to 6. 
    • The RMSEP was slightly lower for 200 training objects than for 50 training objects, but not impressively better. The 200 training objects were not enough.
  • Penalty and bdeg not overly important. 
    • A lower penalty resulted in less boosting steps, but the RMSEP was not systematically influenced. cv.GAMBoost compensates with less steps for a lower penalty
    • Out of the 9 combinations penalty/bdeg, the model with best AIC did not give the best RMSEP.
  • Calculation time of the 200 objects was just feasible. 400 objects would need a lot of patience or better (newer) hardware than I am using. 
    • Parallel execution works for cv.GAMBoost. If only I had more than two cores in my computer.
  • optimGAMBoostPenalty same or higher RMSEP as cv.GAMBoost.
    • In the cases where cv.GAMBoost gave an RMSEP clearly over 1.5, optimGAMBoost performed comparable. 
    • Even in the high data situation, linear generated response and 200 training objects, cv.GAMBoost had lower RMSEP than optimGAMBoostPenalty.

R code

The code is a bit rough especially in output.
library(GAMBoost)
library(snow)
library(snowfall)
sfInit(parallel=TRUE,cpus=2)
rmse <- function(a,b) sqrt(mean((a-b)^2))

lapply(1:3, function(outer) {
ntrain <- 25
ntest <- 50
nrow <- ntrain + ntest
ncol=11

xmat <- matrix(rnorm(nrow*ncol),nrow=nrow,ncol=ncol)
colnames(xmat) <- paste('V', 1:ncol,sep='')
yvec <- xmat[,1]*1 + rnorm(nrow)+ xmat[,2]^2 + 5*sqrt(3.5+xmat[,3])+ 6*sin(3*xmat[,4] + 4*cos(6*xmat[,5]))

alldf <- as.data.frame(cbind(yvec,xmat))

xtrain <- xmat[1:ntrain,]
ytrain <- yvec[1:ntrain]
dftrain <- alldf[1:ntrain,]
xtest <- xmat[(ntrain+1):(ntrain+ntest),]
ytest <- yvec[(ntrain+1):(ntrain+ntest)]
dftest <- alldf[(ntrain+1):(ntrain+ntest),]
mysettings <- expand.grid(penalty=c(10,100,1000),maxstepno=1000,bdeg=c(1,2,3))
myres <- lapply(1:nrow(mysettings),function(x) {
   cvgm <- cv.GAMBoost(x=xtrain,y=ytrain,family=gaussian(),parallel=TRUE,
                       penalty=mysettings$penalty[x],
                       maxstepno=mysettings$maxstepno[x],
                       bdeg=mysettings$bdeg[x])
} )
mysettings$which.min.aic <- sapply(myres,function(x) {
  which.min(x$AIC)
})
mysettings$min.aic <- sapply(myres,function(x) {
  min(x$AIC)
})

mysettings$rmsep <- sapply(myres,function(x) {
  ypredgm <- predict(x,xtest)
  rmse(ypredgm,ytest)
})

  op <- optimGAMBoostPenalty(x=xtrain,y=ytrain,family=gaussian(),maxstepno=1000,parallel=TRUE,trace=FALSE)
  ypredop <- predict(op,xtest)
  
  list(mysettings=mysettings,c(optpenalty=op$penalty,oprmse=rmse(ypredop,ytest)) )
                    
} )






No comments:

Post a Comment