Preprocess: Scaling in Python

We use the data from sklearn library, and the IDE is sublime text3. Most of the code comes from the book: https://www.goodreads.com/book/show/32439431-introduction-to-machine-learning-with-python?from_search=true

###Preprocess methods
###The StandardScaler in scikit-learn ensures that for each feature, the mean is zero, 
###and the variance is one, bringing all features to the same magnitude. However, 
###this scaling does not ensure any particular minimum and maximum values for the features.

###The RobustScaler works similarly to the StandardScaler in that it ensures statistical 
###properties for each feature that guarantee that they are on the same scale. However, 
###the RobustScaler uses the median and quartiles instead of mean and variance

###The MinMaxScaler on the other hand shifts the data such that all features are exactly 
###between 0 and 1.

###the Normalizer does a very different kind of rescaling. It scales each data point such 
###that the feature vector has a euclidean length of one.

from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import make_blobs
from sklearn.datasets import load_breast_cancer
import numpy as np
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split

###We will visualize the scaler effect here
X, _ = make_blobs(n_samples=50, centers=5, random_state=4, cluster_std=2)
X_train, X_test = train_test_split(X, random_state=5, test_size=.1)

fig, axes = plt.subplots(1, 2, figsize=(13, 4))
axes[0].scatter(X_train[:, 0], X_train[:, 1],
                c='b', label="training set", s=60)
axes[0].scatter(X_test[:, 0], X_test[:, 1], marker='^',
                c='r', label="test set", s=60)
axes[0].legend(loc='upper left')
axes[0].set_title("original data")

###scale the data here and visualize
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

axes[1].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1],
                c='b', label="training set", s=60)
axes[1].scatter(X_test_scaled[:, 0], X_test_scaled[:, 1], marker='^',
                c='r', label="test set", s=60)
axes[1].set_title("scaled data")

plt.show()

figure_1.png

###How important is scaling?
from sklearn.svm import SVC
cancer=load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)
svm = SVC(C=100)
svm.fit(X_train, y_train)
print(svm.score(X_test, y_test))
###0.629370629371

scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
svm.fit(X_train_scaled, y_train)
print(svm.score(X_test_scaled, y_test))
###0.965034965035

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s