We use the data from sklearn library, and the IDE is sublime text3. Most of the code comes from the book: https://www.goodreads.com/book/show/32439431-introduction-to-machine-learning-with-python?from_search=true

###sometimes we might face the situation that the features or vars in the data are not separate from each other ###We can always observe that data before we can even preprocess it with correlation, and simply exclude ###some features with high correlation. This method is not compatible with large number of vars and hurts ###the variability of the data somehow. ###Auto encoder is another method to reduce dimensionality: ###Here we will talk about a basic method called Principal Component Analysis (PCA), a widely used algorithm ###for data transformation and preprocessing from sklearn.preprocessing import StandardScaler from sklearn.datasets import load_breast_cancer import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.decomposition import PCA ###1. Before we apply PCA, we scale our data so that each feature has unit variance using StandardScaler import pandas as pd cancer=load_breast_cancer() cancerdata=pd.DataFrame(cancer.data) def rstr(df): return df.shape, df.apply(lambda x:[x.unique()]) print('\n''structure of data:''\n', rstr(pd.DataFrame(cancerdata))) ##We can observe that the data of cancer has 30 vars. scaler = StandardScaler() scaler.fit(cancer.data) X_scaled = scaler.transform(cancer.data) ###2. Let us keep the first 2 principal components of the data pca=PCA(n_components=2) ## fit PCA model to beast cancer data pca.fit(X_scaled) ##transfer the data PCAed X_pca=pca.transform(X_scaled) print("Original shape: %s" % str(X_scaled.shape)) ###Original shape: (569, 30) print("Reduced shape: %s" % str(X_pca.shape)) ###Reduced shape: (569, 2) ###3. plot the first two components plt.figure(figsize=(8, 8)) plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cancer.target, s=60) plt.gca().set_aspect("equal") plt.xlabel("First principal component") plt.ylabel("Second principal component") plt.show()

###4.Another application of PCA is feature extraction, especially in image recognition ##Images are usually stored as red, green and blue intensities for each pixel. ##But images are made up of many pixels, and only together are they meaningful; ##objects in images are usually made up of thousands of pixels. ###4.1 import data of faces of people and plot them from sklearn.datasets import fetch_lfw_people people = fetch_lfw_people(min_faces_per_person=20, resize=0.7) image_shape = people.images[0].shape fix, axes = plt.subplots(2, 5, figsize=(15, 8), subplot_kw={'xticks': (), 'yticks': ()}) for target, image, ax in zip(people.target, people.images, axes.ravel()): ax.imshow(image) ax.set_title(people.target_names[target]) plt.suptitle("some_faces") plt.show()

##We need only at most 50 photos of each one, and we scale the data mask = np.zeros(people.target.shape, dtype=np.bool) for target in np.unique(people.target): mask[np.where(people.target == target)[0][:50]] = 1 X_people = people.data[mask]/255 y_people = people.target[mask] ###4.2 we use a knn algorithm to build the model from sklearn.neighbors import KNeighborsClassifier X_train, X_test, y_train, y_test = train_test_split( X_people, y_people, stratify=y_people, random_state=0) knn = KNeighborsClassifier(n_neighbors=1) knn.fit(X_train, y_train) print(knn.score(X_test, y_test)) ###0.232558139535 ###4.3 the accuracy is quite low, and we apply PCA to imporve it starting with the 1st 100 components pca = PCA(n_components=100, whiten=True).fit(X_train) X_train_pca = pca.transform(X_train) X_test_pca = pca.transform(X_test) knn_pca = KNeighborsClassifier(n_neighbors=1) knn_pca.fit(X_train_pca, y_train) print(knn_pca.score(X_test_pca, y_test)) ###0.315891472868

Advertisements