Clustering Application in Face Recognition in Python

We used face datasets for PCA application here: https://charleshsliao.wordpress.com/2017/05/28/preprocess-pca-application-in-python/
It also will be interesting to see how clustering algorithms assign images into different clusters and visualize them.

We use the data from sklearn library(need to download face datasets separately), and the IDE is sublime text3. Most of the code comes from the book: https://www.goodreads.com/book/show/32439431-introduction-to-machine-learning-with-python?from_search=true

###We use different clustering algs on face datasets
from sklearn.decomposition import PCA
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_lfw_people
###1. Load the face datasets
people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
image_shape = people.images[0].shape
mask = np.zeros(people.target.shape, dtype=np.bool)
for target in np.unique(people.target):
    mask[np.where(people.target == target)[0][:50]] = 1
X_people = people.data[mask]
y_people = people.target[mask]
X_people = X_people / 255
pca=PCA(n_components=100)
pca.fit_transform(X_people)
X_pca=pca.transform(X_people)

###2.1 We Apply DBScan 
dbscan=DBSCAN(min_samples=3, eps=15)
labels=dbscan.fit_predict(X_pca)
print(np.unique(labels))
#[-1  0] 
###Meaning that we have nly one single cluster with noise data point here
####Now we count number of points in all clusters and noise to see excatly what it looks like
print(np.bincount(labels+1))
#[   2 2061]
###We have only two noise points here, labeled with "-1" in the array. We can plot them
noise = X_people[labels==-1]
fig,axes=plt.subplots(1,2)
for image, ax in zip(noise,axes.ravel()):
	ax.imshow(image.reshape(image_shape),vmin=0,vmax=1)
plt.show()

figure_1.png

###This kind of analysis, trying to find “the odd one out”, is called outlier detection.
###We can guess why these two photos are noise data: one with hat, and the other one is too close

###2.2 We can adjust our eps to see more change, with "-1" labeling the noise data points
for eps in [1,3,5,7,9,11,13]:
	print("\neps=%d" % eps)
	dbscan = DBSCAN(eps=eps, min_samples=3)
	labels = dbscan.fit_predict(X_pca)
	print("Number of clusters: %s" % np.unique(labels))
	print("Clusters: %s" % np.bincount(labels + 1))
###eps=1
###Number of clusters: [-1]
###Clusters: [2063]###

###eps=3
###Number of clusters: [-1]
###Clusters: [2063]###

###eps=5
###Number of clusters: [-1]
###Clusters: [2063]###

###eps=7
###Number of clusters: [-1  0  1  2]
###Clusters: [1733  323    4    3]###

###eps=9
###Number of clusters: [-1  0  1  2  3]
###Clusters: [ 694 1360    3    3    3]###

###eps=11
###Number of clusters: [-1  0]
###Clusters: [ 148 1915]###

###eps=13
###Number of clusters: [-1  0]
###Clusters: [  15 2048]

###2.3 Again we can see the core points with eps of 9 within the three small clusters 
dbscan = DBSCAN(min_samples=3, eps=9)
labels = dbscan.fit_predict(X_pca)
for cluster in range(max(labels)):
    mask = labels == cluster
    n_images =  np.sum(mask)
    fig, axes = plt.subplots(1, 3)
    for image, label, ax in zip(X_people[mask], y_people[mask], axes):
    	ax.imshow(image.reshape(image_shape), vmin=0, vmax=1)
    	ax.set_title(people.target_names[label].split()[-1])
plt.show()

figure_3.png
figure_2.png
figure_4.png


###We can observe that some of the clusters correspond to people with very distinct faces 

###3.1 Apply K-means
###Agglomerative clustering and k-Means are much more likely to create clusters of even size, 
###but we do need to set a number of clusters.
n_clusters = 10
# extract clusters with k-Means
kmeans = KMeans(n_clusters=n_clusters, random_state=0)
labels_km = kmeans.fit_predict(X_pca)
print("cluster sizes k-Means: %s" % np.bincount(labels_km))
###[219 240 154 216 240 228 229 201 158 178]

###Visualize the cluster centers. As we clustered in the representation produced by PCA, 
###we need to rotate the cluster centers, using pca.inverse_transform
fig, axes = plt.subplots(2, 5, subplot_kw={'xticks': (), 'yticks': ()}, figsize=(12, 4))
for center, ax in zip(km.cluster_centers_, axes.ravel()):
	ax.imshow(pca.inverse_transform(center).reshape(image_shape), vmin=0, vmax=1)
plt.show()

figure_5.png

###3.2 the result is promising for kmeans, but we might want a more detialed outcome
###Below we show the cluster center, three most typical photos in the cluster 
###and three most atypical images in the cluster
n_clusters = 10
for cluster in range(n_clusters):
    center = kmeans.cluster_centers_[cluster]
    mask = kmeans.labels_ == cluster
    dists = np.sum((X_pca - center) ** 2, axis=1)
    dists[~mask] = np.inf
    inds = np.argsort(dists)[:3]
    dists[~mask] = -np.inf
    inds = np.r_[inds, np.argsort(dists)[-3:]]
    fig, axes = plt.subplots(1, 7, subplot_kw={'xticks': (), 'yticks': ()}, figsize=(10, 8))
    axes[0].imshow(pca.inverse_transform(center).reshape(image_shape), vmin=0, vmax=1)
    for image, label, asdf, ax in zip(X_people[inds], y_people[inds], labels_km[inds], axes[1:]):
        ax.imshow(image.reshape(image_shape), vmin=0, vmax=1)
        ax.set_title("%s" % (people.target_names[label].split()[-1]), fontdict={'fontsize': 9})
plt.show()

###show three of them here only, and the result is quite straightforward

Screen Shot 2017-05-30 at 5.11.32 PM.png
Screen Shot 2017-05-30 at 5.11.36 PM.png
Screen Shot 2017-05-30 at 5.11.41 PM.png

###4.1 Agglomerative clustering. We desing the clusters with the method of kmeans
agglomerative = AgglomerativeClustering(n_clusters=10)
labels_agg = agglomerative.fit_predict(X_pca)
print("cluster sizes agglomerative clustering: %s" % np.bincount(labels_agg))
###cluster sizes agglomerative clustering: [315 268 180 279 302 194 191 129 121  84]

###4.2 plot the according dendogram
from scipy.cluster.hierarchy import dendrogram, ward
linkage_array = ward(X_pca)
plt.figure(figsize=(20, 5))
dendrogram(linkage_array, p=7, truncate_mode='level', no_labels=True)
plt.show()

figure_d.png

###4.3 We plot the cluster centers
n_clusters = 10
for cluster in range(n_clusters):
    mask = labels_agg == cluster
    fig, axes = plt.subplots(1, 10, subplot_kw={'xticks': (), 'yticks': ()}, figsize=(15, 8))
    axes[0].set_ylabel(np.sum(mask))
    for image, label, asdf, ax in zip(X_people[mask], y_people[mask], labels_agg[mask], axes):
        ax.imshow(image.reshape(image_shape), vmin=0, vmax=1)
        ax.set_title("%s" % people.target_names[label].split()[-1], fontdict={'fontsize': 9})
plt.show()

Screen Shot 2017-05-30 at 5.39.02 PM.png

###4.4 Change cluster to 40 and we pick some clusters
n_clusters = 40
for cluster in [15, 7, 17, 20, 25, 29]:
	mask = labels_agg == cluster
	fig, axes = plt.subplots(1, 15, subplot_kw={'xticks': (), 'yticks': ()}, figsize=(15, 8))
	cluster_size = np.sum(mask)
	axes[0].set_ylabel(cluster_size)
	for image, label, asdf, ax in zip(X_people[mask], y_people[mask], labels_agg[mask], axes):
	    ax.imshow(image.reshape(image_shape), vmin=0, vmax=1)
	    ax.set_title("%s" % (people.target_names[label].split()[-1]), fontdict={'fontsize': 9})
	for i in range(cluster_size, 15):
		axes[i].set_visible(False)
plt.show()

figure_ag.png

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s