Preprocess: PCA Application in Python

We use the data from sklearn library, and the IDE is sublime text3. Most of the code comes from the book: https://www.goodreads.com/book/show/32439431-introduction-to-machine-learning-with-python?from_search=true

###sometimes we might face the situation that the features or vars in the data are not separate from each other
###We can always observe that data before we can even preprocess it with correlation, and simply exclude
###some features with high correlation. This method is not compatible with large number of vars and hurts 
###the variability of the data somehow.

###Auto encoder is another method to reduce dimensionality:
###Here we will talk about a basic method called Principal Component Analysis (PCA), a widely used algorithm
###for data transformation and preprocessing

from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
import numpy as np
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA

###1. Before we apply PCA, we scale our data so that each feature has unit variance using StandardScaler
import pandas as pd
cancer=load_breast_cancer()
cancerdata=pd.DataFrame(cancer.data)
def rstr(df):
	return df.shape, df.apply(lambda x:[x.unique()])
print('\n''structure of data:''\n',
	rstr(pd.DataFrame(cancerdata)))
##We can observe that the data of cancer has 30 vars.

scaler = StandardScaler()
scaler.fit(cancer.data)
X_scaled = scaler.transform(cancer.data)

###2. Let us keep the first 2 principal components of the data
pca=PCA(n_components=2)
## fit PCA model to beast cancer data
pca.fit(X_scaled)
##transfer the data PCAed
X_pca=pca.transform(X_scaled)
print("Original shape: %s" % str(X_scaled.shape))
###Original shape: (569, 30)

print("Reduced shape: %s" % str(X_pca.shape))
###Reduced shape: (569, 2)

###3. plot the first two components
plt.figure(figsize=(8, 8))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cancer.target, s=60)
plt.gca().set_aspect("equal")
plt.xlabel("First principal component")
plt.ylabel("Second principal component")
plt.show()

Screen Shot 2017-05-27 at 8.05.12 PM.png


###4.Another application of PCA is feature extraction, especially in image recognition
##Images are usually stored as red, green and blue intensities for each pixel. 
##But images are made up of many pixels, and only together are they meaningful; 
##objects in images are usually made up of thousands of pixels.

###4.1 import data of faces of people and plot them
from sklearn.datasets import fetch_lfw_people
people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
image_shape = people.images[0].shape
fix, axes = plt.subplots(2, 5, figsize=(15, 8), subplot_kw={'xticks': (), 'yticks': ()})
for target, image, ax in zip(people.target, people.images, axes.ravel()):
    ax.imshow(image)
    ax.set_title(people.target_names[target])
plt.suptitle("some_faces")
plt.show()

Screen Shot 2017-05-27 at 9.09.46 PM.png

##We need only at most 50 photos of each one, and we scale the data
mask = np.zeros(people.target.shape, dtype=np.bool)
for target in np.unique(people.target):
    mask[np.where(people.target == target)[0][:50]] = 1
X_people = people.data[mask]/255
y_people = people.target[mask]

###4.2 we use a knn algorithm to build the model
from sklearn.neighbors import KNeighborsClassifier
X_train, X_test, y_train, y_test = train_test_split(
        X_people, y_people, stratify=y_people, random_state=0)
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
print(knn.score(X_test, y_test))
###0.232558139535

###4.3 the accuracy is quite low, and we apply PCA to imporve it starting with the 1st 100 components
pca = PCA(n_components=100, whiten=True).fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
knn_pca = KNeighborsClassifier(n_neighbors=1)
knn_pca.fit(X_train_pca, y_train)
print(knn_pca.score(X_test_pca, y_test))
###0.315891472868
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s