CART, A Regression Tree Model for Wine Choosing

Compared with regression traditionally, decision trees may be better suited for tasks with many features or many complex, non-linear relationships among features and outcome. These situations present challenges for regression. Regression modeling also makes assumptions about how numeric data is distributed that are often violated in real-world data. This is not the case for trees.

Trees for numeric prediction fall into two categories.

The first, known as regression trees, were introduced in the 1980s as part of the seminal Classification and Regression Tree (CART) algorithm. Despite the name, regression trees do not use linear regression methods, rather they make predictions based on the average value of examples that reach a leaf.

The second type of trees for numeric prediction are known as model trees. Introduced several years later than regression trees, they are lesser-known, but perhaps more powerful. Model trees are grown in much the same way as regression trees, but at each leaf, a multiple linear regression model is built from the examples reaching that node. Depending on the number of leaf nodes, a model tree may build tens or even hundreds of such models. This may make model trees more difficult to understand than the equivalent regression tree, with the benefit that they may result in a more accurate model.

We use wine data from (http://archive.ics.uci.edu/ml)


ww<-read.csv("whitewines.csv")
str(ww)

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

#quick check the quality with hist
hist(ww$quality)

Screen Shot 2017-03-04 at 4.24.30 PM.png


wwtrain<-ww[1:3750,]
wwtest<-ww[3751:4898,]
library(rpart)
library(rpart.plot)
wwrpart<-rpart(quality~.,data = wwtrain)
summary(wwrpart)

## Call:
## rpart(formula = quality ~ ., data = wwtrain)
##   n= 3750
##
##           CP nsplit rel error    xerror       xstd
## 1 0.17816211      0 1.0000000 1.0007383 0.02388934
## 2 0.04439109      1 0.8218379 0.8228040 0.02237432
## 3 0.02890893      2 0.7774468 0.7845750 0.02201169
## 4 0.01655575      3 0.7485379 0.7578866 0.02087785
## 5 0.01108600      4 0.7319821 0.7458478 0.02043429
## 6 0.01000000      5 0.7208961 0.7462251 0.02053336
##
## Variable importance
##              alcohol              density            chlorides
##                   38                   23                   12
##     volatile.acidity total.sulfur.dioxide  free.sulfur.dioxide
##                   12                    7                    6
##            sulphates                   pH       residual.sugar
##                    1                    1                    1
##
## Node number 1: 3750 observations,    complexity param=0.1781621
##   mean=5.886933, MSE=0.8373493
##   left son=2 (2473 obs) right son=3 (1277 obs)
##   Primary splits:
##       alcohol              < 10.85    to the left,  improve=0.17816210, (0 missing)
##       density              < 0.992385 to the right, improve=0.11980970, (0 missing)
##       chlorides            < 0.0395   to the right, improve=0.08199995, (0 missing)
##       total.sulfur.dioxide < 153.5    to the right, improve=0.03875440, (0 missing)
##       free.sulfur.dioxide  < 11.75    to the left,  improve=0.03632119, (0 missing)
##   Surrogate splits:
##       density              < 0.99201  to the right, agree=0.869, adj=0.614, (0 split)
##       chlorides            < 0.0375   to the right, agree=0.773, adj=0.334, (0 split)
##       total.sulfur.dioxide < 102.5    to the right, agree=0.705, adj=0.132, (0 split)
##       sulphates            < 0.345    to the right, agree=0.670, adj=0.031, (0 split)
##       fixed.acidity        < 5.25     to the right, agree=0.662, adj=0.009, (0 split)
##
## Node number 2: 2473 observations,    complexity param=0.04439109
##   mean=5.609381, MSE=0.6108623
##   left son=4 (1406 obs) right son=5 (1067 obs)
##   Primary splits:
##       volatile.acidity    < 0.2425   to the right, improve=0.09227123, (0 missing)
##       free.sulfur.dioxide < 13.5     to the left,  improve=0.04177240, (0 missing)
##       alcohol             < 10.15    to the left,  improve=0.03313802, (0 missing)
##       citric.acid         < 0.205    to the left,  improve=0.02721200, (0 missing)
##       pH                  < 3.325    to the left,  improve=0.01860335, (0 missing)
##   Surrogate splits:
##       total.sulfur.dioxide < 111.5    to the right, agree=0.610, adj=0.097, (0 split)
##       pH                   < 3.295    to the left,  agree=0.598, adj=0.067, (0 split)
##       alcohol              < 10.05    to the left,  agree=0.590, adj=0.049, (0 split)
##       sulphates            < 0.715    to the left,  agree=0.584, adj=0.037, (0 split)
##       residual.sugar       < 1.85     to the right, agree=0.581, adj=0.029, (0 split)
##
## Node number 3: 1277 observations,    complexity param=0.02890893
##   mean=6.424432, MSE=0.8378682
##   left son=6 (93 obs) right son=7 (1184 obs)
##   Primary splits:
##       free.sulfur.dioxide  < 11.5     to the left,  improve=0.08484051, (0 missing)
##       alcohol              < 11.85    to the left,  improve=0.06149941, (0 missing)
##       fixed.acidity        < 7.35     to the right, improve=0.04259695, (0 missing)
##       residual.sugar       < 1.275    to the left,  improve=0.02795662, (0 missing)
##       total.sulfur.dioxide < 67.5     to the left,  improve=0.02541719, (0 missing)
##   Surrogate splits:
##       total.sulfur.dioxide < 48.5     to the left,  agree=0.937, adj=0.14, (0 split)
##
## Node number 4: 1406 observations,    complexity param=0.011086
##   mean=5.40256, MSE=0.526423
##   left son=8 (182 obs) right son=9 (1224 obs)
##   Primary splits:
##       volatile.acidity     < 0.4225   to the right, improve=0.04703189, (0 missing)
##       free.sulfur.dioxide  < 17.5     to the left,  improve=0.04607770, (0 missing)
##       total.sulfur.dioxide < 86.5     to the left,  improve=0.02894310, (0 missing)
##       alcohol              < 10.25    to the left,  improve=0.02890077, (0 missing)
##       chlorides            < 0.0455   to the right, improve=0.02096635, (0 missing)
##   Surrogate splits:
##       density       < 0.99107  to the left,  agree=0.874, adj=0.027, (0 split)
##       citric.acid   < 0.11     to the left,  agree=0.873, adj=0.022, (0 split)
##       fixed.acidity < 9.85     to the right, agree=0.873, adj=0.016, (0 split)
##       chlorides     < 0.206    to the right, agree=0.871, adj=0.005, (0 split)
##
## Node number 5: 1067 observations
##   mean=5.881912, MSE=0.591491
##
## Node number 6: 93 observations
##   mean=5.473118, MSE=1.066482
##
## Node number 7: 1184 observations,    complexity param=0.01655575
##   mean=6.499155, MSE=0.7432425
##   left son=14 (611 obs) right son=15 (573 obs)
##   Primary splits:
##       alcohol        < 11.85    to the left,  improve=0.05907511, (0 missing)
##       fixed.acidity  < 7.35     to the right, improve=0.04400660, (0 missing)
##       density        < 0.991395 to the right, improve=0.02522410, (0 missing)
##       residual.sugar < 1.225    to the left,  improve=0.02503936, (0 missing)
##       pH             < 3.245    to the left,  improve=0.02417936, (0 missing)
##   Surrogate splits:
##       density              < 0.991115 to the right, agree=0.710, adj=0.401, (0 split)
##       volatile.acidity     < 0.2675   to the left,  agree=0.665, adj=0.307, (0 split)
##       chlorides            < 0.0365   to the right, agree=0.631, adj=0.237, (0 split)
##       total.sulfur.dioxide < 126.5    to the right, agree=0.566, adj=0.103, (0 split)
##       residual.sugar       < 1.525    to the left,  agree=0.560, adj=0.091, (0 split)
##
## Node number 8: 182 observations
##   mean=4.994505, MSE=0.5109588
##
## Node number 9: 1224 observations
##   mean=5.463235, MSE=0.5002823
##
## Node number 14: 611 observations
##   mean=6.296236, MSE=0.7322117
##
## Node number 15: 573 observations
##   mean=6.715532, MSE=0.6642788

#Visulize the trees
rpart.plot(wwrpart,digits=3,cex=0.9,type=3)

rpart01


rpart.plot(wwrpart, digits = 4, fallen.leaves = TRUE,
type = 3, extra = 101)

rpart02


wwr<-predict(wwrpart,wwtest)
#Evaluate the result with accuracy
#1.
#The correlation between the predicted and actual quality values
#provides a simple way to gauge the model's performance.
#If we treat this as a classfication problem we can use a consusion matrix
cor(wwr,wwtest$quality)

## [1] 0.4931608

#2.
#Another way to think about the model's performance is to consider
#how far, on average, its prediction was from the true value. This
#measurement is called the mean absolute error (MAE).
wwmae<-function(actual,predictr){
mean(abs(actual-predictr))
}
wwmae(wwr,wwtest$quality)

## [1] 0.5732104

Now we use M5P function from RWeka package to build model trees. If you are using MAC, instsall JDK, rJava and RWekajars at first,  then install RWeka. RWeka has a lot of problems in R on OS X. We turned to Cubist package.


# we can use Cubist to build the model tree
library(Cubist)
winecubist<-cubist(x=wwtrain[,-12],y=wwtrain[,12],data = wwtrain,committees = 5)
wine_cubist_predict<-predict(winecubist,newdata=wwtest)
cor(wine_cubist_predict,wwtest$quality)
wwmae<-function(actual,predictr){
mean(abs(actual-predictr))
}
wwmae(wine_cubist_predict,wwtest$quality)

R^2

[1] 0.5598169

MAE
[1] 0.5336499

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s