Another very useful clustering algorithm is DBSCAN (which stands for “Density- based spatial clustering of applications with noise”). The main benefits of DBSCAN are that

###a) it does not require the user to set the number of clusters a priori,

###b) it can capture clusters of complex shapes, and

###c) it can identify point that are not part of any cluster.

We use the data from sklearn library, and the IDE is sublime text3. Most of the code comes from the book: https://www.goodreads.com/book/show/32439431-introduction-to-machine-learning-with-python?from_search=true

###DBSCAN is somewhat slower than agglomerative clustering and k-Means, working by picking ###a point to start with, but still scales to relatively large datasets. ###There are two parameters in DBSCAN, min_samples and eps. If there are at least min_samples ###many data points within a distance of eps to a given data point, it’s called a core sample. ###Core samples that are closer than the distance eps are put into the same cluster by DBSCAN. ###Points more than min-sample are within eps are labeled core, otherwise noise from sklearn.datasets import make_moons import numpy as np from sklearn.datasets import make_blobs from sklearn.cluster import DBSCAN import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split X,y=make_blobs(random_state=0,n_samples=12) dbscan=DBSCAN() clusters=dbscan.fit_predict(X) print(clusters) #[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1] ###All data points were assigned the label -1, which stands for noise. ###Try changing eps and min_samples to see the effect fig,axes=plt.subplots(3,4,figsize=(12,6), subplot_kw={'xticks': (), 'yticks': ()}) ###delete the axises value with subplot_kw colors=np.array(['r','b','g','y']) for i, min_samples in enumerate([2,3,5]): for j, eps in enumerate([1,1.5,2,3]): dbscan=DBSCAN(min_samples=min_samples,eps=eps) clusters=dbscan.fit_predict(X) print('min_samples: %d eps: %.1f clusters: %s'%(min_samples,eps,clusters)) sizes=30*np.ones(X.shape[0]) sizes[dbscan.core_sample_indices_]*=8 ###enlarge the core points by 8 times axes[i,j].scatter(X[:,0],X[:,1],c=colors[clusters],s=sizes) axes[i,j].set_title('min_samples: %d eps: %0.1f'%(min_samples,eps)) fig.tight_layout() plt.show() ###min_samples: 2 eps: 1.0 clusters: [-1 0 0 -1 0 -1 1 1 0 1 -1 -1] ###min_samples: 2 eps: 1.5 clusters: [0 1 1 1 1 0 2 2 1 2 2 0] ###min_samples: 2 eps: 2.0 clusters: [0 1 1 1 1 0 0 0 1 0 0 0] ###min_samples: 2 eps: 3.0 clusters: [0 0 0 0 0 0 0 0 0 0 0 0] ###min_samples: 3 eps: 1.0 clusters: [-1 0 0 -1 0 -1 1 1 0 1 -1 -1] ###min_samples: 3 eps: 1.5 clusters: [0 1 1 1 1 0 2 2 1 2 2 0] ###min_samples: 3 eps: 2.0 clusters: [0 1 1 1 1 0 0 0 1 0 0 0] ###min_samples: 3 eps: 3.0 clusters: [0 0 0 0 0 0 0 0 0 0 0 0] ###min_samples: 5 eps: 1.0 clusters: [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1] ###min_samples: 5 eps: 1.5 clusters: [-1 0 0 0 0 -1 -1 -1 0 -1 -1 -1] ###min_samples: 5 eps: 2.0 clusters: [-1 0 0 0 0 -1 -1 -1 0 -1 -1 -1] ###min_samples: 5 eps: 3.0 clusters: [0 0 0 0 0 0 0 0 0 0 0 0]

We can also use DBSCAN to cluster the data we used for last article:

https://charleshsliao.wordpress.com/2017/05/30/quick-clustering-in-python/

###Points that belong to clusters are colored, while the noise points are shown in yellow. ###Core samples are shown as large points, while border points are displayed as smaller points. ###We can also use DBSCAN to cluster the data we used for last article X, y = make_moons(n_samples=200, noise=0.05, random_state=0) from sklearn.preprocessing import StandardScaler scaler=StandardScaler() scaler.fit(X) X_s=scaler.transform(X) dbscan=DBSCAN() clusters=dbscan.fit_predict(X_s) plt.scatter(X_s[:,0],X_s[:,1],c=clusters,cmap="Paired",s=40) plt.show()