K-means, Hierarchical, and Feature Selection Methods

We will use the well-known iris data set to make some quick clustering.

(site : http://archive.ics.uci.edu/ml/datasets/Iris)

(data : http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data )

(description : http://archive.ics.uci.edu/ml/machine-learning- databases/iris/iris.names).

</pre>
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
order<-sample(150,150*0.8)
training<-iris[order,]
testing<-iris[-order,]

# K is the number of clusters (groups), and it is 3 for this case.
iriskmeans<-kmeans(training[,-5],centers=3,iter.max = 10, nstart=1)
summary(iriskmeans)

##              Length Class  Mode
## cluster      120    -none- numeric
## centers       12    -none- numeric
## totss          1    -none- numeric
## withinss       3    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           3    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

table(training$Species,iriskmeans$cluster)# report the result

##
##               1  2  3
##   setosa     16  0 28
##   versicolor  3 32  0
##   virginica   0 41  0
<pre>plot(training[c("Sepal.Length","Sepal.Width")], col = iriskmeans$cluster, pch = as.integer(training$Species))
#Color menas clustering,shape means original classfication
points(iriskmeans$centers[,c("Sepal.Length", "Sepal.Width")], col = 1:3, pch = 8, cex=2)

0228201706

Hierarchical Clustering with basic functions and adjust distance methods accordingly, since K-means function would not allow user to adjust distance attributes:


dist.r1 <- dist(training[,-5], method="euclidean")
dist.r2 <- dist(training[,-5], method="maximum")
dist.r3 <- dist(training[,-5], method="manhattan")
hc.r1<-hclust(dist.r1, method ="centroid")
hc.r2<-hclust(dist.r2, method ="centroid")
hc.r3<-hclust(dist.r3, method ="centroid")
hc.r1
hc.r2
hc.r3

Now comes the interesting part. We know that only some of the variables of iris dataset are impacting significantly on the final clustering result. How can we pick them out? Related methods are helpful for clustering or other unsupervised ML method of large dataset with high dimensions.

A great article by  here introduced three methods: http://machinelearningmastery.com/feature-selection-with-the-caret-r-package/

  1. Generally, you want to remove attributes with an absolute correlation of 0.75 or higher.
  2. Rank Features By Importance
  3. Automatic feature selection methods

#Feature selection by getting rid of high correlation
training[,5]<-as.numeric(training[,5])
newt<-training
cor(newt[,1:5])

##              Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## Sepal.Length    1.0000000  -0.1137410    0.8787128   0.8271220  0.7891058
## Sepal.Width    -0.1137410   1.0000000   -0.4288660  -0.3710568 -0.4355827
## Petal.Length    0.8787128  -0.4288660    1.0000000   0.9651961  0.9514171
## Petal.Width     0.8271220  -0.3710568    0.9651961   1.0000000  0.9587977
## Species         0.7891058  -0.4355827    0.9514171   0.9587977  1.0000000

# Remove Petal.Length and Petal.Width
training<-iris[order,]
iriskmeans2<-kmeans(training[,1:2],centers=3,iter.max = 10, nstart=1)
table(training$Species,iriskmeans2$cluster)

##
##               1  2  3
##   setosa     44  0  0
##   versicolor  4  5 26
##   virginica   1 20 20

plot(training[c("Sepal.Length", "Sepal.Width")], col = iriskmeans2$cluster, pch = as.integer(training$Species))
points(iriskmeans2$centers[,c("Sepal.Length", "Sepal.Width")], col = 1:3, pch = 8, cex=2)</div>

0228201707

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s