Credit Analysis with ROC evaluation in Neural Network and Random Forest

This is quite like the article using C5.0 to conduct classification: https://charleshsliao.wordpress.com/2017/03/04/a-quick-classification-example-with-c5-0-in-r/

We tried to use more mature and powerful algorithms with cross validation and parameters tuning.

1. At first we preprocess the data.

</pre>
##################################################
#1. load, clean and preprocess data
url <- 'https://onlinecourses.science.psu.edu/stat857/sites/onlinecourses.science.psu.edu.stat857/files/german_credit.csv'
german_credit <- read.csv(url, header = TRUE, sep = ',')
# as the results shown from control part of Caret package used in NN adn RF,
# it is better to change the value of this variable for smooth processing
german_credit$No.of.Credits.at.this.Bank[german_credit$No.of.Credits.at.this.Bank == 4] <- 3
str(german_credit)

## 'data.frame':    1000 obs. of  21 variables:
##  $ Creditability                    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Account.Balance                  : int  1 1 2 1 1 1 1 1 4 2 ...
##  $ Duration.of.Credit..month.       : int  18 9 12 12 12 10 8 6 18 24 ...
##  $ Payment.Status.of.Previous.Credit: int  4 4 2 4 4 4 4 4 4 2 ...
##  $ Purpose                          : int  2 0 9 0 0 0 0 0 3 3 ...
##  $ Credit.Amount                    : int  1049 2799 841 2122 2171 2241 3398 1361 1098 3758 ...
##  $ Value.Savings.Stocks             : int  1 1 2 1 1 1 1 1 1 3 ...
##  $ Length.of.current.employment     : int  2 3 4 3 3 2 4 2 1 1 ...
##  $ Instalment.per.cent              : int  4 2 2 3 4 1 1 2 4 1 ...
##  $ Sex...Marital.Status             : int  2 3 2 3 3 3 3 3 2 2 ...
##  $ Guarantors                       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Duration.in.Current.address      : int  4 2 4 2 4 3 4 4 4 4 ...
##  $ Most.valuable.available.asset    : int  2 1 1 1 2 1 1 1 3 4 ...
##  $ Age..years.                      : int  21 36 23 39 38 48 39 40 65 23 ...
##  $ Concurrent.Credits               : int  3 3 3 3 1 3 3 3 3 3 ...
##  $ Type.of.apartment                : int  1 1 1 1 2 1 2 2 2 1 ...
##  $ No.of.Credits.at.this.Bank       : num  1 2 1 2 2 2 2 1 2 1 ...
##  $ Occupation                       : int  3 3 2 2 2 2 2 2 1 1 ...
##  $ No.of.dependents                 : int  1 2 1 2 1 2 1 2 1 1 ...
##  $ Telephone                        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Foreign.Worker                   : int  1 1 1 2 2 2 2 2 1 1 ...

to_exclude<-c(3,6,14)
german_credit_clean<-german_credit[,-to_exclude]
str(german_credit_clean)

## 'data.frame':    1000 obs. of  18 variables:
##  $ Creditability                    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Account.Balance                  : int  1 1 2 1 1 1 1 1 4 2 ...
##  $ Payment.Status.of.Previous.Credit: int  4 4 2 4 4 4 4 4 4 2 ...
##  $ Purpose                          : int  2 0 9 0 0 0 0 0 3 3 ...
##  $ Value.Savings.Stocks             : int  1 1 2 1 1 1 1 1 1 3 ...
##  $ Length.of.current.employment     : int  2 3 4 3 3 2 4 2 1 1 ...
##  $ Instalment.per.cent              : int  4 2 2 3 4 1 1 2 4 1 ...
##  $ Sex...Marital.Status             : int  2 3 2 3 3 3 3 3 2 2 ...
##  $ Guarantors                       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Duration.in.Current.address      : int  4 2 4 2 4 3 4 4 4 4 ...
##  $ Most.valuable.available.asset    : int  2 1 1 1 2 1 1 1 3 4 ...
##  $ Concurrent.Credits               : int  3 3 3 3 1 3 3 3 3 3 ...
##  $ Type.of.apartment                : int  1 1 1 1 2 1 2 2 2 1 ...
##  $ No.of.Credits.at.this.Bank       : num  1 2 1 2 2 2 2 1 2 1 ...
##  $ Occupation                       : int  3 3 2 2 2 2 2 2 1 1 ...
##  $ No.of.dependents                 : int  1 2 1 2 1 2 1 2 1 1 ...
##  $ Telephone                        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Foreign.Worker                   : int  1 1 1 2 2 2 2 2 1 1 ...
<pre>for(i in 1:18) {
   german_credit_clean[, i] <- as.factor(german_credit_clean[, i])}
 order<-sample(1000,650)
 gcc_train<-german_credit_clean[order,]
 gcc_test<-german_credit_clean[-order,]
 gcc_test_result<-as.factor(as.numeric(gcc_test[,1])-1)
 ##################################################

2. We can start with the basic logistic regression model.

</pre>
##################################################
#2. Build Logstic regression model
set.seed(2017)
gcc_lg<-glm(Creditability ~ ., family = binomial, data = gcc_train)
gcc_lg_predict<-predict(gcc_lg,type="response",newdata=gcc_test)
gcc_lg_predict_result<-rep("1",1000-650)
gcc_lg_predict_result[gcc_lg_predict<0.5]="0"
# since the model of logstic regression generates pro not class
table(gcc_lg_predict_result,gcc_test_result)

##                      gcc_test_result
## gcc_lg_predict_result   0   1
##                     0  46  47
##                     1  55 202

error_rate_lg<-sum(gcc_lg_predict_result!=gcc_test_result)/nrow(as.data.frame(gcc_test_result))
# build ROC
library(ROCR)

## Loading required package: gplots

##
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
##
##     lowess

pred_lg <- ROCR::prediction(gcc_lg_predict, gcc_test$Creditability)
perf_lg <- performance(pred_lg, 'tpr', 'fpr')
plot(perf_lg)

Auc_lg<- performance(pred_lg, measure = 'auc')@y.values[[1]]
print(paste0("The Area Under the Curve is ",Auc_lg))

## [1] "The Area Under the Curve is 0.729531989343513"

print(paste0("The Accuracy(Precision) is ",1-error_rate_lg))

## [1] "The Accuracy(Precision) is 0.708571428571429"
<pre>##################################################

The ROC chart below shows the Average Under Curve value as a metric evaluating the  model. Screen Shot 2017-03-14 at 12.01.05 PM.png 3. We also build the CART model since it behaves well generally.

</pre>
##################################################
#3. Build r-part (CART) model
set.seed(2017)
library(rpart)
library(rpart.plot)
gcc_rpart<-rpart(Creditability ~ .,  method="class",data = gcc_train)
rpart.plot(gcc_rpart,type=4,under=T,varlen=-1,tweak=1.1,shadow.col="grey")

prp(gcc_rpart,type=2,tweak=0.8)

gcc_rpart_predict<-predict(gcc_rpart,newdata=gcc_test,type="class")
table("Predict Value"=gcc_rpart_predict,"Actual Value"=gcc_test_result)

##              Actual Value
## Predict Value   0   1
##             0  40  38
##             1  61 211

error_rate_rpart<-sum(gcc_rpart_predict!=gcc_test_result)/nrow(as.data.frame(gcc_test_result))
# build ROC
fit_rpart<-predict(gcc_rpart,newdata=gcc_test,type="prob")[,2]
pred_rpart <- ROCR::prediction(fit_rpart, gcc_test$Creditability)
perf_rpart <- performance(pred_rpart, 'tpr', 'fpr')
plot(perf_rpart)
Auc_rpart<- performance(pred_rpart, measure = 'auc')@y.values[[1]]
print(paste0("The Area Under the Curve is ",Auc_rpart))

## [1] "The Area Under the Curve is 0.726172014791841"

print(paste0("The Accuracy(Precision) is ",1-error_rate_rpart))

## [1] "The Accuracy(Precision) is 0.717142857142857"
<pre>##################################################

The two styles of plots of Rpart and the ROC curve.

4. We set up Random Forest model with CV based on Caret package. We need to convert “0” and “1” to more factor-like value such as “X1” and “X0” to conduct train function of caret package.

</pre>
##################################################
#4. Build Random Forest
library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

gcc_train_rf<-gcc_train
levels <- unique(gcc_train_rf[,1])
gcc_train_rf[,1] <- factor(gcc_train_rf[,1], labels=make.names(levels))
gcc_rf<-randomForest(Creditability ~ .,  method="class",data = gcc_train_rf)
gcc_rf

##
## Call:
##  randomForest(formula = Creditability ~ ., data = gcc_train_rf,      method = "class")
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
##
##         OOB estimate of  error rate: 24.77%
## Confusion matrix:
##    X1  X0 class.error
## X1 79 120  0.60301508
## X0 41 410  0.09090909

# the error rate would be 26.77% without CV
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

##
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
##
##     margin

#build validation control
ctrl<-trainControl(method="repeatedcv",number=10,repeats = 5)
#set up the tuning grid for the random forest. The only tuning
#parameter for this model is mtry, which defines how many features
#are randomly selected at each split. By default, we know that the
#random forest will use sqrt(16), or four features per tree
grid_rf<-expand.grid(.mtry=c(2,4,8,16))
set.seed(666)
gcc_rfcv<-train(Creditability ~ .,data = gcc_train,method="rf",
trControl=ctrl,tuneGrid=grid_rf)
gcc_rfcv_predict_results<-predict(gcc_rfcv,newdata=gcc_test,type="raw")
gcc_rfcv_predict_results<-factor(as.numeric(gcc_rfcv_predict_results)-1)
table("Predict Value"=gcc_rfcv_predict_results,"Actual Value"=gcc_test_result)

##              Actual Value
## Predict Value   0   1
##             0  31  35
##             1  70 214

error_rate_rfcv<-sum(gcc_rfcv_predict_results!=gcc_test_result)/nrow(as.data.frame(gcc_test_result))
# build ROC
fit_rfcv<-predict(gcc_rfcv,newdata=gcc_test,type="prob")[,2]
pred_rfcv <- ROCR::prediction(fit_rfcv, gcc_test$Creditability)
perf_rfcv <- performance(pred_rfcv, 'tpr', 'fpr')
plot(perf_rfcv)

Auc_rfcv<- performance(pred_rfcv, measure = 'auc')@y.values[[1]]
print(paste0("The Area Under the Curve is ",Auc_rfcv))

## [1] "The Area Under the Curve is 0.711380174161995"

print(paste0("The Accuracy(Precision) is ",1-error_rate_rfcv))

## [1] "The Accuracy(Precision) is 0.7"

The ROC curve and Accuracy rate
Screen Shot 2017-03-14 at 12.04.35 PM

5. The neural network.

##################################################
#5. Build Neural Network with nnet and caret packages
library(nnet)
library(pROC)

## Type 'citation("pROC")' for a citation.

##
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
##
##     cov, smooth, var

gcc_train_nn<-gcc_train
levels <- unique(gcc_train_nn[,1])
gcc_train_nn[,1] <- factor(gcc_train_nn[,1], labels=make.names(levels))
numFolds <- trainControl(method = 'cv', number = 10, repeats=3, classProbs = TRUE,
verboseIter = TRUE, summaryFunction = twoClassSummary)
gcc_nn<- train(x=gcc_train_nn[,-1],y=gcc_train_nn[,1], method = "nnet"
, trControl = numFolds, maxit=300,
tuneGrid=expand.grid(size=c(15), decay=c(0.1)))

## Warning in train.default(x = gcc_train_nn[, -1], y = gcc_train_nn[, 1], :
## The metric "Accuracy" was not in the result set. ROC will be used instead.

## + Fold01: size=15, decay=0.1
## # weights:  781
## initial  value 502.649011
## iter  10 value 263.503075
## skipped
## stopped after 300 iterations

gcc_nn_predict_results<-predict(gcc_nn,newdata=gcc_test,type = "raw")
gcc_nn_predict_results<-factor(as.numeric(gcc_nn_predict_results)-1)
table("Predict Value"=gcc_nn_predict_results,"Actual Value"=gcc_test_result)

##              Actual Value
## Predict Value   0   1
##             0  53  49
##             1  48 200

error_rate_nn<-sum(gcc_nn_predict_results!=gcc_test_result)/nrow(as.data.frame(gcc_test_result))
# build ROC
fit_nn<-predict(gcc_nn,newdata=gcc_test,type="prob")[,2]
pred_nn <- ROCR::prediction(fit_nn, gcc_test$Creditability)
perf_nn <- performance(pred_nn, 'tpr', 'fpr')
plot(perf_nn)

Auc_nn<- performance(pred_nn, measure = 'auc')@y.values[[1]]
print(paste0("The Area Under the Curve is ",Auc_nn))

## [1] "The Area Under the Curve is 0.718915265020478"

print(paste0("The Accuracy(Precision) is ",1-error_rate_nn))

## [1] "The Accuracy(Precision) is 0.722857142857143"
<pre>

Not that significant.

Screen Shot 2017-03-14 at 12.04.44 PM.png

Advertisements

1 thought on “Credit Analysis with ROC evaluation in Neural Network and Random Forest”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s