Examples and Summary of Non-linear Regression in R, with IMDB Movie Data

As digital production of information becomes increasingly cheap and easy, people are offered with more and more options for consuming those digital productions in a limited time. Rating (score) has thus become an essential ingredient to help people make those choices. Predicting rating score in turn becomes a lucrative business for effective marketing.

Use the Internet Movie Database (abbreviated IMDB), an online database of information related to films, television programs and video games, to predict rating scores of those productions.

Here is the dataset you will be using.

(1) It is a subset of the IMDB database, and contains 28 variables for 5043 movies and 4906 posters spanning across 100 years in 66 countries. There are 2399 unique director names, and
thousands of actors/actresses. (https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset)

 


movie<-read.csv("movie_metadata.csv",header=T,sep=",")
str(movie)

## 'data.frame':    5043 obs. of  28 variables:
##  $ color                    : Factor w/ 3 levels ""," Black and White",..: 3 3 3 3 1 3 3 3 3 3 ...
##  $ director_name            : Factor w/ 2399 levels "","A. Raven Cruz",..: 927 801 2027 377 603 106 2030 1652 1228 551 ...
##  $ num_critic_for_reviews   : int  723 302 602 813 NA 462 392 324 635 375 ...
##  $ duration                 : int  178 169 148 164 NA 132 156 100 141 153 ...
##  $ director_facebook_likes  : int  0 563 0 22000 131 475 0 15 0 282 ...
##  $ actor_3_facebook_likes   : int  855 1000 161 23000 NA 530 4000 284 19000 10000 ...
##  $ actor_2_name             : Factor w/ 3033 levels "","50 Cent","A. Michael Baldwin",..: 1407 2218 2488 534 2432 2549 1227 801 2439 653 ...
##  $ actor_1_facebook_likes   : int  1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
##  $ gross                    : int  760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
##  $ genres                   : Factor w/ 914 levels "Action","Action|Adventure",..: 107 101 128 288 754 126 120 308 126 447 ...
##  $ actor_1_name             : Factor w/ 2098 levels "","50 Cent","A.J. Buckley",..: 302 979 353 1968 526 440 785 221 336 32 ...
##  $ movie_title              : Factor w/ 4917 levels "[Rec] ","[Rec] 2 ",..: 398 2731 3279 3708 3332 1961 3291 3459 399 1631 ...
##  $ num_voted_users          : int  886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
##  $ cast_total_facebook_likes: int  4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
##  $ actor_3_name             : Factor w/ 3522 levels "","50 Cent","A.J. Buckley",..: 3442 1392 3134 1769 1 2714 1969 2162 3018 2941 ...
##  $ facenumber_in_poster     : int  0 0 1 0 0 1 0 1 4 3 ...
##  $ plot_keywords            : Factor w/ 4761 levels "","10 year old|dog|florida|girl|supermarket",..: 1320 4283 2076 3484 1 651 4745 29 1142 2005 ...
##  $ movie_imdb_link          : Factor w/ 4919 levels "http://www.imdb.com/title/tt0006864/?ref_=fn_tt_tt_1",..: 2965 2721 4533 3756 4918 2476 2526 2458 4546 2551 ...
##  $ num_user_for_reviews     : int  3054 1238 994 2701 NA 738 1902 387 1117 973 ...
##  $ language                 : Factor w/ 48 levels "","Aboriginal",..: 13 13 13 13 1 13 13 13 13 13 ...
##  $ country                  : Factor w/ 66 levels "","Afghanistan",..: 65 65 63 65 1 65 65 65 65 63 ...
##  $ content_rating           : Factor w/ 19 levels "","Approved",..: 10 10 10 10 1 10 10 9 10 9 ...
##  $ budget                   : num  2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA ...
##  $ title_year               : int  2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 ...
##  $ actor_2_facebook_likes   : int  936 5000 393 23000 12 632 11000 553 21000 11000 ...
##  $ imdb_score               : num  7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
##  $ aspect_ratio             : num  1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
##  $ movie_facebook_likes     : int  33000 0 85000 164000 0 24000 0 29000 118000 10000 ...

library(dplyr)

##
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
##
##     filter, lag

## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union

library(neuralnet)

##
## Attaching package: 'neuralnet'

## The following object is masked from 'package:dplyr':
##
##     compute

library(nnet)
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(pROC)

## Type 'citation("pROC")' for a citation.

##
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
##
##     cov, smooth, var

library(Cubist)
library(rpart)
library(glmnet)

## Loading required package: Matrix

## Loading required package: foreach

## Loaded glmnet 2.0-5

##
## Attaching package: 'glmnet'

## The following object is masked from 'package:pROC':
##
##     auc

library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

##
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
##
##     margin

## The following object is masked from 'package:dplyr':
##
##     combine

library(kernlab)

##
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
##
##     alpha

##to make life easy we will focus on non-text variables
colnames(movie)

##  [1] "color"                     "director_name"
##  [3] "num_critic_for_reviews"    "duration"
##  [5] "director_facebook_likes"   "actor_3_facebook_likes"
##  [7] "actor_2_name"              "actor_1_facebook_likes"
##  [9] "gross"                     "genres"
## [11] "actor_1_name"              "movie_title"
## [13] "num_voted_users"           "cast_total_facebook_likes"
## [15] "actor_3_name"              "facenumber_in_poster"
## [17] "plot_keywords"             "movie_imdb_link"
## [19] "num_user_for_reviews"      "language"
## [21] "country"                   "content_rating"
## [23] "budget"                    "title_year"
## [25] "actor_2_facebook_likes"    "imdb_score"
## [27] "aspect_ratio"              "movie_facebook_likes"

lapply(movie,class)

## $color
## [1] "factor"
##
## $director_name
## [1] "factor"
##
## $num_critic_for_reviews
## [1] "integer"
##
## $duration
## [1] "integer"
##
## $director_facebook_likes
## [1] "integer"
##
## $actor_3_facebook_likes
## [1] "integer"
##
## $actor_2_name
## [1] "factor"
##
## $actor_1_facebook_likes
## [1] "integer"
##
## $gross
## [1] "integer"
##
## $genres
## [1] "factor"
##
## $actor_1_name
## [1] "factor"
##
## $movie_title
## [1] "factor"
##
## $num_voted_users
## [1] "integer"
##
## $cast_total_facebook_likes
## [1] "integer"
##
## $actor_3_name
## [1] "factor"
##
## $facenumber_in_poster
## [1] "integer"
##
## $plot_keywords
## [1] "factor"
##
## $movie_imdb_link
## [1] "factor"
##
## $num_user_for_reviews
## [1] "integer"
##
## $language
## [1] "factor"
##
## $country
## [1] "factor"
##
## $content_rating
## [1] "factor"
##
## $budget
## [1] "numeric"
##
## $title_year
## [1] "integer"
##
## $actor_2_facebook_likes
## [1] "integer"
##
## $imdb_score
## [1] "numeric"
##
## $aspect_ratio
## [1] "numeric"
##
## $movie_facebook_likes
## [1] "integer"

column<-c(2,7,10,11,12,15,17,18,20,21,22)
movie_n<-movie[,-column]
lapply(movie_n,class)

## $color
## [1] "factor"
##
## $num_critic_for_reviews
## [1] "integer"
##
## $duration
## [1] "integer"
##
## $director_facebook_likes
## [1] "integer"
##
## $actor_3_facebook_likes
## [1] "integer"
##
## $actor_1_facebook_likes
## [1] "integer"
##
## $gross
## [1] "integer"
##
## $num_voted_users
## [1] "integer"
##
## $cast_total_facebook_likes
## [1] "integer"
##
## $facenumber_in_poster
## [1] "integer"
##
## $num_user_for_reviews
## [1] "integer"
##
## $budget
## [1] "numeric"
##
## $title_year
## [1] "integer"
##
## $actor_2_facebook_likes
## [1] "integer"
##
## $imdb_score
## [1] "numeric"
##
## $aspect_ratio
## [1] "numeric"
##
## $movie_facebook_likes
## [1] "integer"

movie_n$color<-as.numeric(movie_n$color)

sum(is.na(movie_n)) # so many NAs and we cannot just delete them all...

## [1] 2059

#Use Amelia here to impute data sets
library(Amelia)

## Loading required package: Rcpp

## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2017 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

missmap(movie_n)

movie_n_c<-amelia(movie_n,p2s=0)
write.amelia(obj = movie_n_c, file.stem = "movie_n_c")
movie_n_c_m<-read.csv("movie_n_c1.csv",header=T,sep=",")
sum(is.na(movie_n_c_m))

## [1] 0

str(movie_n_c_m)

## 'data.frame':    5043 obs. of  18 variables:
##  $ X                        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ color                    : int  3 3 3 3 1 3 3 3 3 3 ...
##  $ num_critic_for_reviews   : num  723 302 602 813 41.5 ...
##  $ duration                 : num  178 169 148 164 72.8 ...
##  $ director_facebook_likes  : num  0 563 0 22000 131 475 0 15 0 282 ...
##  $ actor_3_facebook_likes   : num  855 1000 161 23000 244 ...
##  $ actor_1_facebook_likes   : num  1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
##  $ gross                    : num  760505847 309404152 200074175 448130642 -22159440 ...
##  $ num_voted_users          : int  886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
##  $ cast_total_facebook_likes: int  4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
##  $ facenumber_in_poster     : num  0 0 1 0 0 1 0 1 4 3 ...
##  $ num_user_for_reviews     : num  3054 1238 994 2701 -134 ...
##  $ budget                   : num  2.37e+08 3.00e+08 2.45e+08 2.50e+08 2.59e+07 ...
##  $ title_year               : num  2009 2007 2015 2012 1996 ...
##  $ actor_2_facebook_likes   : num  936 5000 393 23000 12 632 11000 553 21000 11000 ...
##  $ imdb_score               : num  7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
##  $ aspect_ratio             : num  1.78 2.35 2.35 2.35 2.79 ...
##  $ movie_facebook_likes     : int  33000 0 85000 164000 0 24000 0 29000 118000 10000 ...

movie_data<-movie_n_c_m[,-1]
order<-sample(5043,4000)
movie_train<-movie_data[order,]
movie_test<-movie_data[-order,]
Movie_rating_actual<-movie_test[,15]

#########################################
#linear regression-lasso-
set.seed(2017)
Movie_lasso_seeker<-cv.glmnet(as.matrix
(movie_train[,-15]),as.matrix(movie_train[,15]))
plot(Movie_lasso_seeker)

Movie_lasso_seeker$lambda.1se#0.0346119

## [1] 0.03495735

Movie_lasso<-glmnet(as.matrix(movie_train[,-15]),
as.matrix(movie_train[,15]),alpha=1,lambda = 0.0346119)
# With lambda of 0.0346119
coef(Movie_lasso)

## 17 x 1 sparse Matrix of class "dgCMatrix"
##                                      s0
## (Intercept)                4.044668e+01
## color                     -1.643380e-01
## num_critic_for_reviews     1.485058e-03
## duration                   3.920846e-03
## director_facebook_likes    .
## actor_3_facebook_likes    -5.650267e-06
## actor_1_facebook_likes     .
## gross                     -1.982404e-09
## num_voted_users            2.726167e-06
## cast_total_facebook_likes  .
## facenumber_in_poster      -1.474479e-03
## num_user_for_reviews      -1.208860e-05
## budget                     .
## title_year                -1.720517e-02
## actor_2_facebook_likes     .
## aspect_ratio               7.295441e-02
## movie_facebook_likes       .

Movie_lasso_predict<-predict(Movie_lasso,
newx=as.matrix(movie_test[,-15]),s=0.0346119)
Movie_lasso_RMSE<-sqrt(mean((Movie_lasso_predict - Movie_rating_actual)^2))
print(paste0("RMSE of Lasson Regression: ",Movie_lasso_RMSE))

## [1] "RMSE of Lasson Regression: 0.908081692433048"

print(paste0("R^2 of Lasson Regression: ",
cor(Movie_lasso_predict, Movie_rating_actual)^2))

## [1] "R^2 of Lasson Regression: 0.29085976878676"

#########################################
#non-linear Rpart(cart)
Movie_rpart<-rpart(movie_train[,15]~.,data=movie_train)
Movie_rpart_predict<-predict(Movie_rpart,newdata=movie_test)
Movie_rpart_RMSE<-sqrt(mean((Movie_rpart_predict - Movie_rating_actual)^2))
print(paste0("RMSE of Regression Tree: ",Movie_rpart_RMSE))

## [1] "RMSE of Regression Tree: 0.201212546198223"

print(paste0("R^2 of Regression Tree: ",
cor(Movie_rpart_predict, Movie_rating_actual)^2))

## [1] "R^2 of Regression Tree: 0.964806832013899"

#########################################
#non-linear Cubist(model tree-a rule learner)
Movie_cubist<-cubist(x=movie_train[,-15],y=movie_train[,15],committees = 20)
Movie_cubist_predict<-predict(Movie_cubist,newdata=movie_test)
Movie_cubist_RMSE<-sqrt(mean((Movie_cubist_predict - Movie_rating_actual)^2))
print(paste0("RMSE of Model Tree: ",Movie_cubist_RMSE))

## [1] "RMSE of Model Tree: 0.779054282330458"

print(paste0("R^2 of Model Tree: ",
cor(Movie_cubist_predict, Movie_rating_actual)^2))

## [1] "R^2 of Model Tree: 0.473986734764114"

#########################################
#non-linear Neural Network
numFolds <- trainControl(method = 'cv', number = 10, classProbs = F,
verboseIter = TRUE,
preProcOptions = list(thresh = 0.75, ICAcomp = 3, k = 5))
Movie_nn<- train(x=movie_train[,-15],y=movie_train[,15], method = "nnet", preProcess =
c('center', 'scale'), trControl = numFolds,
tuneGrid=expand.grid(size=c(10), decay=c(0.1)))

## + Fold01: size=10, decay=0.1
## # weights:  181
## initial  value 124706.189125
## iter  10 value 111162.750519
## iter  20 value 111134.763028
## final  value 111134.723223
## converged
## - Fold01: size=10, decay=0.1
## + Fold02: size=10, decay=0.1
## # weights:  181
## initial  value 119863.214694
## iter  10 value 111300.578484
## iter  20 value 111147.376809
## final  value 111147.293741
## converged
## - Fold02: size=10, decay=0.1
## + Fold03: size=10, decay=0.1
## # weights:  181
## initial  value 127435.723537
## iter  10 value 111246.310730
## iter  20 value 111140.978208
## iter  30 value 111137.839585
## final  value 111137.825160
## converged
## - Fold03: size=10, decay=0.1
## + Fold04: size=10, decay=0.1
## # weights:  181
## initial  value 140347.048749
## iter  10 value 111240.971502
## iter  20 value 111100.373428
## final  value 111100.363499
## converged
## - Fold04: size=10, decay=0.1
## + Fold05: size=10, decay=0.1
## # weights:  181
## initial  value 133684.915738
## iter  10 value 111159.032827
## iter  20 value 111046.384025
## iter  30 value 111033.457316
## final  value 111033.413553
## converged
## - Fold05: size=10, decay=0.1
## + Fold06: size=10, decay=0.1
## # weights:  181
## initial  value 121177.447532
## iter  10 value 111144.612892
## iter  20 value 110977.823782
## final  value 110977.793205
## converged
## - Fold06: size=10, decay=0.1
## + Fold07: size=10, decay=0.1
## # weights:  181
## initial  value 120550.745685
## iter  10 value 111232.960034
## iter  20 value 111071.770700
## iter  30 value 111071.644014
## iter  30 value 111071.643352
## iter  30 value 111071.643345
## final  value 111071.643345
## converged
## - Fold07: size=10, decay=0.1
## + Fold08: size=10, decay=0.1
## # weights:  181
## initial  value 125041.312725
## iter  10 value 111176.602647
## iter  20 value 111027.622211
## final  value 111027.573812
## converged
## - Fold08: size=10, decay=0.1
## + Fold09: size=10, decay=0.1
## # weights:  181
## initial  value 132089.817688
## iter  10 value 111451.346547
## iter  20 value 111087.662178
## iter  30 value 111083.571055
## final  value 111083.553540
## converged
## - Fold09: size=10, decay=0.1
## + Fold10: size=10, decay=0.1
## # weights:  181
## initial  value 125482.037103
## iter  10 value 111340.477187
## iter  20 value 111172.586695
## final  value 111172.563413
## converged
## - Fold10: size=10, decay=0.1
## Aggregating results
## Fitting final model on full training set
## # weights:  181
## initial  value 160357.278201
## iter  10 value 123581.526303
## iter  20 value 123431.778791
## final  value 123431.538367
## converged

Movie_nn_predict<-predict(Movie_nn,newdata=movie_test,type="raw")
Movie_nn_RMSE<-sqrt(mean((Movie_nn_predict - Movie_rating_actual)^2))
print(paste0("RMSE of nn: ",Movie_nn_RMSE))

## [1] "RMSE of nn: 5.56598631378482"

print(paste0("R^2 of nn: ",
cor(Movie_nn_predict, Movie_rating_actual)^2))

## [1] "R^2 of nn: 0.189664339777952"

#########################################
#non-linear KNN regression
Movie_knnreg<-knnreg(movie_train[,-15],y=movie_train[,15],k=5)
Movie_knnreg_predict<-predict(Movie_knnreg,newdata=movie_test[-15])
Movie_knnreg_RMSE<-sqrt(mean((Movie_knnreg_predict - Movie_rating_actual)^2))
print(paste0("RMSE of KNN regression: ",Movie_knnreg_RMSE))

## [1] "RMSE of KNN regression: 1.1384027652733"

print(paste0("R^2 of KNN regression: ",
cor(Movie_knnreg_predict, Movie_rating_actual)^2))

## [1] "R^2 of KNN regression: 0.0208118811483293"

#########################################
#non-linear Ksvm
Movie_ksvm<-ksvm(movie_train[,15]~.,data=movie_train,scale=T)
Movie_ksvm_predict<-predict(Movie_ksvm,newdata=movie_test)
Movie_ksvm_RMSE<-sqrt(mean((Movie_ksvm_predict - Movie_rating_actual)^2))
print(paste0("RMSE of Ksvm: ",Movie_ksvm_RMSE))

## [1] "RMSE of Ksvm: 0.233379891444411"

print(paste0("R^2 of Ksvm ",
cor(Movie_ksvm_predict, Movie_rating_actual)^2))

## [1] "R^2 of Ksvm 0.953082484445953"

The best algorithms to do the job is Ksvm and Rpart.

 

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s