Quick Cross Validation and Grid Search of Parameters in Python

Cross Validation is a way to lift overfitting during training model, and we also applied Grid Search method in both python and R:
https://charleshsliao.wordpress.com/2017/05/20/logistic-regression-in-python-to-tune-parameter-c/
https://charleshsliao.wordpress.com/2017/04/24/cnndnn-of-keras-in-r-backend-tensorflow-for-mnist/

We will focus on how to use both the methods to identify the best parameters, model and score without overfitting.

We use the data from sklearn library, and the IDE is sublime text3. Most of the code comes from the book: https://www.goodreads.com/book/show/32439431-introduction-to-machine-learning-with-python?from_search=true

import numpy as np
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split
###cross-validation is a more robust way to assess generalization performance than a single 
###split of the data into a training and a test set. We will also discuss methods to evaluate 
###classification and regression performance that go beyond the default measures of accuracy and 
###R^2 provided by the score method. 

###Grid search is an effective method for adjusting the parameters in supervised models 
###for the best generalization performance

###1.1 CV in sklearn
###The parameters of the cross_val_score function are the model we want to evaluate,
### the training data and the ground-truth labels. By default, cross_val_score 
### performs three-fold cross-validation, returning three accuracy values.
###cross_val_score is the core function for cross validation 
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
iris=load_iris()
logreg=LogisticRegression()

scores=cross_val_score(logreg,iris.data,iris.target,cv=10)
print("cross-validation scores:",scores)
#[ 1.          1.          1.          0.93333333  0.93333333  0.93333333
# 0.8         0.93333333  1.          1.        ]
print("mean of cross-validation scores:",scores.mean())
# mean of cross-validation scores: 0.953333333333

###1.2 K-Fold CV, shuffle the data to remove the ordering of the samples by label
from sklearn.model_selection import KFold
kfold = KFold(n_splits=10, shuffle=True, random_state=42)
print(cross_val_score(logreg,iris.data,iris.target,cv=10))
#[ 1.          1.          1.          0.93333333  0.93333333  0.93333333
#  0.8         0.93333333  1.          1.        ]

###For most use cases, the default of k-fold cross validation for regression and 
###stratified k-fold for classification work well, but there are some cases when 
###we might want to use a different strategy.

###1.3 Leave one out CV
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
scores = cross_val_score(logreg, iris.data, iris.target, cv=loo)
print("number of cv iterations: ", len(scores))
print("mean accuracy: ", scores.mean())
#number of cv iterations:  150
#mean accuracy:  0.953333333333

###1.4 Shuffle-split CV
###Each split samples train_size many points for the training set, and test_size many (disjoint) 
###point for the test set. This splitting is repeated n_splits many times.
from sklearn.model_selection import ShuffleSplit
shuffle_split = ShuffleSplit(n_splits=10,test_size=.5, train_size=.5)
print(cross_val_score(logreg, iris.data, iris.target, cv=shuffle_split))
#[ 0.93333333  0.85333333  0.94666667  0.96        0.93333333  0.97333333
#  0.96        0.97333333  0.93333333  0.86666667]

###1.5 CV for different groups. 
###Sometimes we want to evalute diff groups share some mutual attributes,
### like the same emotional faces of different people,we can use GroupKFold
###(not LabelKFold in the old versions)

from sklearn.model_selection import GroupKFold
from sklearn.datasets import make_blobs
X,y=make_blobs(n_samples=12, random_state=1)
####Assume the first three samples belong to the same group, then the next four etc.
labels=[0,0,0,1,1,1,1,2,2,3,3,3]
print(cross_val_score(logreg,X,y,labels,cv=GroupKFold(n_splits=3)))
#[ 0.75  0.6   1.  ]

###2.1 Simple Grid Search
###Finding the values of the important parameters of a model (the ones that provide the best 
###generalization performance) is a tricky task, but necessary for almost all models and datasets.
###Naive grid search implementation
from sklearn.svm import SVC
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0)
print("Size of training set: %d   size of test set: %d" % (X_train.shape[0], X_test.shape[0]))
#Size of training set: 112   size of test set: 38

best_score=0
gammarange=[0.001, 0.01, 0.1, 1, 10, 100]
crange=[0.001, 0.01, 0.1, 1, 10, 100]
for gamma in gammarange:
	for c in crange:
		svm=SVC(gamma=gamma,C=c)
		svm.fit(X_train,y_train)
		score=svm.score(X_test,y_test)
		if score>best_score:
			best_score=score
			best_parameters={'C':c,'gamma':gamma}
print("best score: ",best_score)
print("best parameters: ", best_parameters)
#best score:  0.973684210526
#best parameters:  {'C': 100, 'gamma': 0.001}

###As we can see here, the final result is promising. We still need to consider risk of overfitting
###2.2 The grid search with CV
X_trainval, X_test, y_trainval, y_test=train_test_split(iris.data, iris.target, random_state=0)
X_train, X_valid, y_train, y_valid = train_test_split(X_trainval, y_trainval, random_state=1)
gammarange=[0.001, 0.01, 0.1, 1, 10, 100]
crange=[0.001, 0.01, 0.1, 1, 10, 100]
for gamma in gammarange:
	for c in crange:
		svm=SVC(gamma=gamma,C=c)
		scores=cross_val_score(svm,X_trainval,y_trainval,cv=10)
		score=np.mean(scores)
		if score>best_score:
			best_score=score
			best_parameters={'C':c,'gamma':gamma}
svm_best=SVC(**best_parameters)
print(svm_best.fit(X_trainval,y_trainval)) ###print out the parameters 
#SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
#  decision_function_shape=None, degree=3, gamma=0.01, kernel='rbf',
#  max_iter=-1, probability=False, random_state=None, shrinking=True,
#  tol=0.001, verbose=False)

print(svm_best.score(X_test,y_test))
###0.973684210526

###2.3 GridSearchCV 
###To use the GridSearchCV class, you first need to specify the parameters you 
###want to search over using a dictionary.
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100],'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
from sklearn.model_selection import GridSearchCV
grid_search=GridSearchCV(SVC(),param_grid,cv=10)
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0)
###The grid_search object that we created behaves just like a classifier; we can call the 
###standard methods fit, predict and score on it 

print(grid_search.fit(X_train,y_train))
#GridSearchCV(cv=10, error_score='raise',
#       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
#  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
#  max_iter=-1, probability=False, random_state=None, shrinking=True,
#  tol=0.001, verbose=False),
#       fit_params={}, iid=True, n_jobs=1,
#       param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100], 'gamma': [0.001, 0.01, 0.1, 1, 10, 100]},
#       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
#       scoring=None, verbose=0)
print(grid_search.score(X_test,y_test))
###0.973684210526 
print(grid_search.best_params_)
print(grid_search.best_score_)
#{'C': 100, 'gamma': 0.01}
#0.982142857143
###Using the score method (or evaluating the output of the predict method) 
###employs a model trained on the whole training set. The best_score_ attribute stores 
###the mean validation cross- validation accuracy, with cross-validation performed on the 
###training set (this is an averaged value of 10 iteration).

###print out the actual model
print(grid_search.best_estimator_)
#SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
#  decision_function_shape=None, degree=3, gamma=0.01, kernel='rbf',
#  max_iter=-1, probability=False, random_state=None, shrinking=True,
#  tol=0.001, verbose=False)

###print out each combination and the result
print(grid_search.grid_scores_)
###SKIPPED

import numpy as np
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split

###1. Nested CV
###Instead of splitting the original data into training and test set once, as what we did in gridsearch,
###we use multiple splits of cross-validation, or nested cross-validation. Its result is a list of scores, 

###As it doesn’t provide a model that can be used on new data, nested cross-validation is rarely used when 
###looking for a predictive model to apply to future data.
###It can be good to evaluate how good a given model works on a particular dataset, though.
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
iris=load_iris()
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100],'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
scores = cross_val_score(GridSearchCV(SVC(), param_grid, cv=5), iris.data, iris.target, cv=5)
print("Cross-validation scores: ", scores)
print("Mean cross-validation score: ", scores.mean())
#Cross-validation scores:  [ 0.96666667  1.          0.96666667  0.96666667  1.        ]
#Mean cross-validation score:  0.98

###We build the loop to include both inner and outer splits and CV to imply nest_cv
def nested_cv(X, y, inner_cv, outer_cv, Classifier, parameter_grid):
       outer_scores = []
       # for each split of the data in the outer cross-validation
       # (split method returns indices)
       for training_samples, test_samples in outer_cv.split(X, y):
           # find best parameter using inner cross-validation:
           best_parms = {}
           best_score = -np.inf
           # iterate over parameters
           for parameters in parameter_grid:
               # accumulate score over inner splits
               cv_scores = []
               # iterate over inner cross-validation
               for inner_train, inner_test in inner_cv.split(X[training_samples], y[training_samples]):
                   # build classifier given parameters and training data
                   clf = Classifier(**parameters)
                   clf.fit(X[inner_train], y[inner_train])
                   # evaluate on inner test set
                   score = clf.score(X[inner_test], y[inner_test])
                   cv_scores.append(score)
               # compute mean score over inner folds
               mean_score = np.mean(cv_scores)
               if mean_score > best_score:
                   # if better than so far, remember parameters
                   best_score = mean_score
                   best_params = parameters
           # build classifier on best parameters using outer training set
           clf = Classifier(**best_params)
           clf.fit(X[training_samples], y[training_samples])
           # evaluate
           outer_scores.append(clf.score(X[test_samples], y[test_samples]))
       return outer_scores
from sklearn.model_selection import ParameterGrid, StratifiedKFold
print(nested_cv(iris.data, iris.target, StratifiedKFold(5), StratifiedKFold(10), SVC, ParameterGrid(param_grid)))
#[1.0, 0.93333333333333335, 1.0, 1.0, 1.0, 0.93333333333333335, 0.93333333333333335, 0.93333333333333335, 1.0, 1.0]
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s