Clustering Algorithms Evaluation in Python

Sometimes we conduct clustering to match the clusters with the true labels of the dataset. Apparently this is one method to evaluate clustering results.
We can also use other methods to complete the task with or without ground truth of the data.

We use the data from sklearn library, and the IDE is sublime text3. Most of the code comes from the book: https://www.goodreads.com/book/show/32439431-introduction-to-machine-learning-with-python?from_search=true

###Assess the outcome of a clustering algorithm with the most important ones: 
###adjusted rand index (ARI) and normalized mutual information (NMI)
###which both provide a quanti‐ tative measure between 0 and 1.

from sklearn.datasets import make_moons
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split

###1.We can simply compare clustering algorithms with sample data with ARI
from sklearn.metrics.cluster import adjusted_rand_score
X, y = make_moons(n_samples=200, noise=0.05, random_state=0)
# Rescale the data to zero mean and unit variance
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
fig, axes = plt.subplots(1, 4, figsize=(15, 3), subplot_kw={'xticks': (), 'yticks': ()})
# create a random cluster assignment for reference:
random_state = np.random.RandomState(seed=0)
random_clusters = random_state.randint(low=0, high=2, size=len(X))

axes[0].scatter(X_scaled[:, 0], X_scaled[:, 1], c=random_clusters, cmap="Paired", s=60)
axes[0].set_title("Random assignment - ARI: %.2f" % adjusted_rand_score(y, random_clusters))

algs= [KMeans(n_clusters=2), AgglomerativeClustering(n_clusters=2), DBSCAN()]
for ax, alg in zip(axes[1:], algs):
	clusters = alg.fit_predict(X_scaled)
	ax.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap="Paired", s=60)
	ax.set_title("%s - ARI: %.2f" % (alg.__class__.__name__, adjusted_rand_score(y, clusters)))
plt.show()
### the ARI rate is a intuitive result as shown below with DBSCAN providing 1.00 rate of ARI perfectly 

figure_1.png

###2. In reality, ARI is not sufficient enough to evluate clustering due to none truth or labels
###Metrics for clustering that don’t require ground truth, like the sil‐ houette coe cient. 
###However, these often don’t work well in practice.
from sklearn.metrics.cluster import silhouette_score
X, y = make_moons(n_samples=200, noise=0.05, random_state=0)
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
fig, axes = plt.subplots(1, 4, figsize=(15, 3), subplot_kw={'xticks': (), 'yticks': ()})
# create a random cluster assignment for reference:
random_state = np.random.RandomState(seed=0)
random_clusters = random_state.randint(low=0, high=2, size=len(X))
axes[0].scatter(X_scaled[:, 0], X_scaled[:, 1], c=random_clusters, cmap="Paired", s=60)
axes[0].set_title("Random assignment - ARI: %.2f" % adjusted_rand_score(y, random_clusters))
algs= [KMeans(n_clusters=2), AgglomerativeClustering(n_clusters=2), DBSCAN()]
for ax, alg in zip(axes[1:], algs):
	clusters = alg.fit_predict(X_scaled)
	ax.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap="Paired", s=60)
	ax.set_title("%s : %.2f" % (alg.__class__.__name__, silhouette_score(X_scaled, clusters)))
plt.show()

### k-Means gets the highest silhouette score, even though we might pre‐ fer the result produced by DBSCAN

figure_2.png

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s