Quick Clustering in Python

We use the data from sklearn library, and the IDE is sublime text3. Most of the code comes from the book: https://www.goodreads.com/book/show/32439431-introduction-to-machine-learning-with-python?from_search=true We can see more about TSNE here: http://distill.pub/2016/misread-tsne/

from sklearn.datasets import make_moons
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split
X, y = make_moons(n_samples=200, noise=0.05, random_state=0)
kmeans=KMeans(n_clusters=10)
kmeans.fit(X)
y_pred=kmeans.predict(X)
plt.scatter(X[:,0],X[:,1],c=y_pred,s=10,cmap="Paired")
plt.scatter(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,1],
    marker="^",c=range(kmeans.n_clusters),s=100,linewidths=2,cmap="Paired")
print(y_pred)
plt.show()

figure_1.png

###Agglomerative clustering refers to a collection of clustering algorithms that 
###all build upon the same principles: The algorithm starts by declaring each 
###point its own clus‐ ter, and then merges the two most similar clusters until 
###some stopping criterion is satisfied. Because of the way the algorithm works, 
###agglomerative clustering can not make pre‐ dictions for new data points. 

from sklearn.cluster import AgglomerativeClustering
X, y = make_blobs(random_state=1)
agg = AgglomerativeClustering(n_clusters=3)
assignment = agg.fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=assignment, cmap="Paired", s=60)
plt.show()

figure_2.png

###Agglomerative clustering produces what is known as a hierarchical clustering
###The following three choices are implemented in scikit-learn:
###• “ward”, which is the default choice. Ward picks the two clusters to merge 
###such that the variance within all clusters increases the least. This often 
###leads to clusters that are relatively equally sized.
###• “average” linkage merges the two clusters that have the smallest average 
###distance between all their points.
###• “complete” linkage (also known as maximum linkage) merges the two clusters
###that have the smallest maximum distance between their points.

from scipy.cluster.hierarchy import dendrogram, ward
X, y = make_blobs(random_state=0, n_samples=12)
# apply the ward clustering to the data array X
linkage_array = ward(X)

dendrogram(linkage_array);
# mark the cuts in the tree that signify two or three clusters
ax = plt.gca()
bounds = ax.get_xbound()
ax.plot(bounds, [7.25, 7.25], '--', c='k')
ax.plot(bounds, [4, 4], '--', c='k')
ax.text(bounds[1], 7.25, ' two clusters', verticalalignment='center', fontdict={'size': 15})
ax.text(bounds[1], 4, ' three clusters', verticalalignment='center', fontdict={'size': 15})
plt.title("dendrogram")
plt.show()

Screen Shot 2017-05-29 at 10.43.13 PM.png

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s