Regarding CPU speed, my current laptop has a lowly Celeron 877. From what I see at my computers activity, under R it is mostly one core which does the work. Which means that even though there are two cores the single core CPU mark of 715 (from cpubenchmark.net) is what I have available. A bit of checking shows the current batch of processors has mainly more cores. For instance, the highest rated common CPU, an Intel Core i7-4710HQ, has a CPU mark of 7935 and single core 1870. That is 2.5 times faster for one core. But it is best because there are four cores. The same is true down the line. Four cores is common. But single core speed has not improved that much. Unless I can actually use those extra cores, what is the gain? Hence I am wondering, can I do something with extra cores for real world R computations? For this I can investigate.
Easy approach, Parallel
A bit of browsing shows that the parallel package is the easy way to use multiple cores, think of using mclapply() rather than lapply. And in many situations this is easy, for instance, cross validation is easy, except for the small upfront cost of partitioning the data in chunks. Trying different settings for a machine learning problem is similar.To give this a certain real world setting, data was taken from the UCI machine learning repository: Physicochemical Properties of Protein Tertiary Structure Data Set which has 45730 rows and 9 variables. A bit of plotting shows this figure for 2000 random selected rows. It seems the problem is not so much which variables to use but rather interactions. This was also suggested by poor performance of linear regression.
Random forest in parallel
Even though nine variables is a bit low for random forest, I elected to use it as first technique. The main variables to tune are nodesize and number of variables to try. Hence I wrapped this in mclapply, not even using a cross validation and taking care not to nest the mclapply calls. The result was a big usage of memory. Which in hindsight may be obvious. Each of the instances gets a complete data set. The net effect is that I ran out of RAM and data was swapped. This cannot be good for performance. It may also explain comments I have read that the caret package uses too much memory. A decent set of hardware for machine learning including a four core processor would create four instances of the same data. Perhaps adding another 4 GB of memory and an SSD rather than a HDD would serve me just as well as a new laptop...tol <- expand.grid(mtry=1:3,
nodesize=c(3,5,10))
bomen <- mclapply(seq(1:nrow(tol)),function(i)
randomForest(
y=train[,1],
x=train[,-1],
ntree=50,
mtry=tol$mtry[i],
nodesize=tol$nodesize[i])
)
"Perhaps adding another 4 GB of memory and an SSD rather than a HDD would serve me just as well as a new laptop"
ReplyDeleteNope. In your situation it is past time you upgrade. Even a single-threaded process will see a 2-10x speed up going from an old celeron to a broadwell/skylake. The only questions are whether to go with 2 or 4 cores, 8 or 16GB of RAM and size of SSD (though that will at least be upgradable in the future). If you are doing serious computation there is no reason to short on hardware, it is far cheaper than your time.
First of all thank you for your comment. I agree my time should be more precious than my computer. However, I am not using this laptop professionally. At work there is a quite different setup. For professional use I would probably follow M Edward Borasky and get a custom made workstation.
DeleteSo, in this not professional usage, there are relatively few occasions where my current laptop is not sufficient. Mostly when looking at data mining and big data. In addition, it seems that next year might bring improvements in processors from both Intel and AMD. Buying this year means not buying next year. Hence the idea that I can work this one a bit longer.
Look into Teraproc. They make a cloud instance of R with RStudio that is already highly tuned for multi-core and GPU computing. It runs off of Amazon, but they auto-configure everything--even a small cluster if you want it including making all the connections. Buying a super gaming rig like mentioned above costs $2600 bucks--but renting one costs about 50 cents an hour.
ReplyDelete