Quick Example of Parallel Computation in R for SVM/Random Forest, with MNIST and Credit Data

It is generally acknowledged that SVM algorithm is relatively slow to train, even with tuning parameters such as cost and kernel.

The general way to boost the speed is to apply packages of “parallel” “do parallel” “doSNOW” and for each function.

Data and background: Data and background: https://charleshsliao.wordpress.com/2017/02/24/svm-tuning-based-on-mnist/


##########################################################
#1. ste up load data function
load_image_file <- function(filename) {
ret = list()
f = file(filename,'rb')
readBin(f,'integer',n=1,size=4,endian='big')
ret$n = readBin(f,'integer',n=1,size=4,endian='big')
nrow = readBin(f,'integer',n=1,size=4,endian='big')
ncol = readBin(f,'integer',n=1,size=4,endian='big')
x = readBin(f,'integer',n=ret$n*nrow*ncol,size=1,signed=F)
ret$x = matrix(x, ncol=nrow*ncol, byrow=T)
close(f)
ret
}

load_label_file <- function(filename) {
f = file(filename,'rb')
readBin(f,'integer',n=1,size=4,endian='big')
n = readBin(f,'integer',n=1,size=4,endian='big')
y = readBin(f,'integer',n=n,size=1,signed=F)
close(f)
y
}
##########################################################
#2. load data time comparison w/wo doParallel
pt<- proc.time()
imagetraining<-as.data.frame(load_image_file("train-images-idx3-ubyte"))
proc.time()-pt

##    user  system elapsed
##   2.952   0.282   3.260

imagetest<-as.data.frame(load_image_file("t10k-images-idx3-ubyte"))
labeltraining<-as.factor(load_label_file("train-labels-idx1-ubyte"))
labeltest<-as.factor(load_label_file("t10k-labels-idx1-ubyte"))
imagetraining[,1]<-labeltraining

library(doParallel)

## Loading required package: foreach

## Loading required package: iterators

## Loading required package: parallel

cl <- makeCluster(detectCores())
registerDoParallel(cl)
pt<- proc.time()
imagetraining_p<-as.data.frame(load_image_file("train-images-idx3-ubyte"))
proc.time()-pt

##    user  system elapsed
##   2.976   0.301   3.328

stopCluster(cl)

imagetest[,1]<-labeltest
Training<-imagetraining
Test<-imagetest

##########################################################
# 3. Train and predict for SVM within doParallel
library(e1071)  # Support Vector Machine (SVM)
samplenumber<-20000 # change sample size here
vec<-seq(from=1,to=60000,by=1)
mysample<-sample(vec,samplenumber)
mysampleTraining<-Training[mysample,]

pt <- proc.time()
svmmodel <- svm(formula=mysampleTraining$n~.,data = mysampleTraining,
method="class",kernel="linear",scale=F, cost=10)
proc.time() - pt

##    user  system elapsed
## 197.287   1.850 202.080

cl <- makeCluster(detectCores())
registerDoParallel(cl)
pt<- proc.time()
svmmodel_p <- svm(formula=mysampleTraining$n~.,data = mysampleTraining,
method="class",kernel="linear",scale=F, cost=10)
proc.time()-pt

##    user  system elapsed
## 204.249   2.862 210.902

stopCluster(cl)

pt <- proc.time()
svmp <- predict(svmmodel, newdata = Test, type = "class")
proc.time() - pt

##    user  system elapsed
##  36.656   0.434  37.364

cl <- makeCluster(detectCores())
registerDoParallel(cl)
pt <- proc.time()
svmp <- predict(svmmodel, newdata = Test, type = "class")
proc.time() - pt

##    user  system elapsed
##  36.255   0.360  36.847

stopCluster(cl)

##############################################
# 4.Do parallel computation with doSNOW and for each
library(doSNOW)

## Loading required package: snow

##
## Attaching package: 'snow'

## The following objects are masked from 'package:parallel':
##
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, clusterSplit, makeCluster,
##     parApply, parCapply, parLapply, parRapply, parSapply,
##     splitIndices, stopCluster

cl <- makeCluster(detectCores())
registerDoSNOW(cl)
pt<- proc.time()
svmmodel_dnfe_p <- foreach(i = 1:detectCores()) %dopar% {
library(e1071)
predict(svmmodel, newdata = Test, type = "class")
}
proc.time()-pt

##    user  system elapsed
##   0.442   0.104  61.053

stopCluster(cl)

It is not ensured that we can increase the efficiency of training or predicting process with parallel computing.

The most ideal situation to apply this method is calculation for math or resampling or ensemble scenarios such as boostrapping, cross-validation, neural networking and random forest. That is why the results above gave out lower efficiency than situations without parallel computing.

To prove this speculation, I took a look at the random forest model trained for German Credit Data:

https://charleshsliao.wordpress.com/2017/03/14/credit-analysis-with-roc-evaluation-in-neural-network-and-random-forest/

In step 4 I added Parallel Computation in two ways as conducted above,


library(randomForest)
library(caret)
ctrl<-trainControl(method="repeatedcv",number=10,repeats = 5)
grid_rf<-expand.grid(.mtry=c(2,4,8,16))
set.seed(666)
pt<-proc.time()
gcc_rfcv<-train(Creditability ~ .,data = gcc_train,method="rf",
trControl=ctrl,tuneGrid=grid_rf)
proc.time()-pt
#user system elapsed
#176.236 4.723 182.843

library(doSNOW)
library(doParallel)
cl <- makeCluster(detectCores())
registerDoSNOW(cl)
pt<-proc.time()
gcc_rfcv_dnef<- foreach(i = 1:detectCores()) %dopar% {
caret::train(Creditability ~ .,data = gcc_train,method="rf",
trControl=ctrl,tuneGrid=grid_rf)
}
proc.time()-pt
#user system elapsed
#0.540 0.626 291.556
stopCluster(cl)

cl <- makeCluster(detectCores())
registerDoParallel(cl)
pt<- proc.time()
gcc_rfcv_dp<-train(Creditability ~ .,data = gcc_train,method="rf",
trControl=ctrl,tuneGrid=grid_rf)
proc.time()-pt
stopCluster(cl)
#user system elapsed
#2.049 0.263 89.511

As shown above, in the ensemble scenario of random forest, to conduct the whole processing period within parallel computation environment with package of “doParallel” would produce significantly higher efficiency:

#user system elapsed
#2.049 0.263 89.511

The first method, to manually set environment and assignment data to process into according segments based on cpus’ number with foreach function and doSNOW package resulted in disappointing result:

#user system elapsed
#0.540 0.626 291.556

To conclude, I need a new Mac Pro.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s