Pipeline Steps in Python

We use the data from sklearn library(need to download face datasets separately), and the IDE is sublime text3. Most of the code comes from the book: https://www.goodreads.com/book/show/32439431-introduction-to-machine-learning-with-python?from_search=true

###################always keep the below code#####################
import os
import sys
sys.path.append('//anaconda/lib/python3.6/site-packages')
###################always keep the above code#####################

###we can use the Pipeline to express the work-flow for training an SVM after scaling the data 
###MinMaxScaler (for now without the grid-search). First, we build a pipeline object, by providing 
###it with a list of steps. Each step is a tuple containing a name and an instance of an estimator
import numpy as np
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler

cancer=load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)

###1. Build a simple pipeline to train the model
from sklearn.pipeline import Pipeline
pipe = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC())])
Pipeline(steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), 
	('svm', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
      decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
      max_iter=-1, probability=False, random_state=None, shrinking=True,
      tol=0.001, verbose=False))])
pipe.fit(X_train, y_train)
print(pipe.score(X_test,y_test))
#0.951048951049

###Here we first create two steps within Pipeline(), and then specify parameters of the steps
### and finally fit the model

###2.Use pipelines in Gridsearch
###The main benefit of using the pipeline, however, is that we can now use this single estimator 
###in cross_val_score or GridSearchCV. We can use GridSearchCV directly in pipe
param_grid = {'svm__C': [0.001, 0.01, 0.1, 1, 10, 100],
'svm__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(pipe, param_grid=param_grid, cv=10)
grid.fit(X_train, y_train)
print("best cross-validation accuracy:", grid.best_score_)
print("test set score: ", grid.score(X_test, y_test))
print("best parameters: ", grid.best_params_)
#best cross-validation accuracy: 0.978873239437
#test set score:  0.972027972028
#best parameters:  {'svm__C': 1, 'svm__gamma': 1}

###3. We can simplify the process of pipeline
from sklearn.pipeline import make_pipeline
pipe_short = make_pipeline(MinMaxScaler(), SVC(C=100))

pipe_short.steps
[('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))),
 ('svc', SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False))]
pipe_short.fit(X_train, y_train)
print(pipe_short.score(X_test,y_test))
#0.965034965035

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s