Ensemble with Random Forest, Boosting, and Caret Package

Ensemble methods help improve performance of different models with methods of bagging, boosting, random forests.

We use credit data from: https://charleshsliao.wordpress.com/2017/03/04/a-quick-classification-example-with-c5-0-in-r/

The Caret package here is powerful and enable us to consolidate cross validation and training process together with flexibility to tune the model specifically.


library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

set.seed(2017)
credit<-read.csv("credit.csv")
#be sure you are doing classifying
credit$default<-as.factor(credit$default)
class(credit$default)

## [1] "factor"

rfm<-randomForest(default~.,data=credit)
rfm

##
## Call:
##  randomForest(formula = default ~ ., data = credit)
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
##
##         OOB estimate of  error rate: 23.6%
## Confusion matrix:
##     1   2 class.error
## 1 643  57  0.08142857
## 2 179 121  0.59666667

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

##
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
##
##     margin

#build validation control
ctrl<-trainControl(method="repeatedcv",number=10,repeats = 3)
#set up the tuning grid for the random forest. The only tuning
#parameter for this model is mtry, which defines how many features
#are randomly selected at each split. By default, we know that the
#random forest will use sqrt(16), or four features per tree
grid_rf<-expand.grid(.mtry=c(2,4,8,16))
set.seed(300)
m_rfm<-train(default~.,data=credit,method="rf",metric="Kappa",
trControl=ctrl,tuneGrid=grid_rf)
m_rfm

## Random Forest
##
## 1000 samples
##   20 predictor
##    2 classes: '1', '2'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 900, 900, 900, 900, 900, 900, ...
## Resampling results across tuning parameters:
##
##   mtry  Accuracy   Kappa
##    2    0.7216667  0.1068365
##    4    0.7536667  0.2964340
##    8    0.7590000  0.3456676
##   16    0.7580000  0.3603359
##
## Kappa was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 16.

#compare with boosting
grid_c50<-expand.grid(.model="tree",.trials=c(10,20,30,40),
.winnow="FALSE")
set.seed(400)
m_c50m<-train(default~.,data=credit, method="C5.0",metric="Kappa",
trControl=ctrl,tuneGrid=grid_c50)

## Loading required package: C50

## Loading required package: plyr

## Warning in Ops.factor(x$winnow): '!' not meaningful for factors

m_c50m

## C5.0
##
## 1000 samples
##   20 predictor
##    2 classes: '1', '2'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 900, 900, 900, 900, 900, 900, ...
## Resampling results across tuning parameters:
##
##   trials  Accuracy   Kappa
##   10      0.7376667  0.3238532
##   20      0.7413333  0.3356165
##   30      0.7476667  0.3500891
##   40      0.7433333  0.3403729
##
## Tuning parameter 'model' was held constant at a value of tree
##
## Tuning parameter 'winnow' was held constant at a value of FALSE
## Kappa was used to select the optimal model using  the largest value.
## The final values used for the model were trials = 30, model = tree
##  and winnow = FALSE.

Without boosting, the original accuracy of C5.0 model above is 0.68.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s