K-MEANS AND HIERARCHICAL CLUSTERING IN PYTHON

arikuncoro June 5, 2020

Tonight, 5 June 2020, I was assigned by IYKRA to deliver “Clustering” online class training at Data MBA Batch #3 program. I would like to show you the summary of the class. Here the agenda is.

Pre-Quiz (19:00 – 19:05)

Theory of unsupervised learning: Clustering (19:05 – 19:20)
K-Means Clustering (19:20 – 19:40)
Hierarchical Clustering (19:40 – 20:00)
Measurement Parameters for Clustering (20:00 – 20:15)
Hands on and Exercise of K-Means Clustering and Hierarchical Clustering with Python (20:15 – 20:45)

Q & A (20:45 – 20:55)
Post-Quiz (20:55 – 21:00)

Note: The hands on will be delivered in Python (Point 5).

1. Theory of unsupervised learning: Clustering. (19:05 – 19:20)

What is clustering?

Clustering is the most common form of unsupervised learning, a type of machine learning algorithm used to draw inferences from unlabeled data.

Clustering is the task of grouping together a set of objects in a way that objects in the same cluster are more similar to each other than to objects in other clusters. (https://www.datacamp.com/community/tutorials/k-means-clustering-python

Clustering is mainly used for exploratory data mining. It has manifold usage in many fields such as machine learning, pattern recognition, image analysis, information retrieval, bio-informatics, data compression, and computer graphics.

Goal of clustering

Clustering algorithms group a set of data points into subsets or clusters. The algorithms’ goal is to create clusters that are coherent internally, but clearly different from each other externally. In other words, entities within a cluster should be as similar as possible and entities in one cluster should be as dissimilar as possible from entities in another.

Measurement of Similarity

Similarity is a metric that reflects the strength of relationship between two data objects. The similarity between the clusters is often calculated from the dissimilarity measures like the euclidean distance between two clusters. So the larger the distance between two clusters, the better it is.

A. Euclidean (if you have continuous numerical values)
B. Jaccard (if you have categorical data)
C. Manhattan
D. Canberra
E. Minkowski

Pre-processing Operation before Clustering

A. Scaling : Min-Max Normalization, Z-Score, Log Transform
Check detail methods: https://developers.google.com/machine-learning/clustering/prepare-data
B. Imputation : It’s very crucial to deal with missing/null/inf values in your dataset beforehand. There are many ways to deal with such values, one is to either remove them or impute them with mean, median, mode or use some advanced regression techniques.
Check the detail methods: https://machinelearningmastery.com/handle-missing-data-python/

2. K-Means Clustering (19:20 – 19:40)

What is K-Means Clustering?
K-Means falls under the category of centroid-based clustering. A centroid is a data point (imaginary or real) at the center of a cluster. In centroid-based clustering, clusters are represented by a central vector or a centroid. This centroid might not necessarily be a member of the dataset. Centroid-based clustering is an iterative algorithm in which the notion of similarity is derived by how close a data point is to the centroid of the cluster.

https://www.datacamp.com/community/tutorials/k-means-clustering-python

https://developers.google.com/machine-learning/clustering/algorithm/run-algorithm

Advantages of K-means

A. Relatively simple to implement.
B. Scales to large data sets.
C. Guarantees convergence.
D. Can warm-start the positions of centroids.
E. Easily adapts to new examples.
F. Generalizes to clusters of different shapes and sizes, such as elliptical clusters.

Disadvantages of K-Means
A. Choosing manually.
B. Being dependent on initial values.
C. Clustering data of varying sizes and density.
D. Clustering outliers.
E. Scaling with number of dimensions

Source: https://developers.google.com/machine-learning/clustering/algorithm/advantages-disadvantages

3. Hierarchical Clustering (19:40 – 20:00)

Two ways of Hierarchical clustering data

Based on the algorithmic structure and operation, there are two ways of clustering data:

A. Divisive (Top Down): A divisive method begins with all patterns in a single cluster and performs splitting until a stopping criterion is met.

B. Agglomerative (Buttom Up) : An agglomerative approach begins with each observation in a distinct (singleton) cluster, and successively merges clusters together until a stopping criterion is satisfied.

We are going to use agglomerative way.

https://www.datacamp.com/community/tutorials/hierarchical-clustering-R

What is the key operation of Hierarchical Agglomerative clustering?
The key operation in hierarchical agglomerative clustering is to repeatedly combine the two nearest clusters into a larger cluster. There are three key questions that need to be answered first:

How do you represent a cluster of more than one point?
How do you determine the “nearness” of clusters?
When do you stop combining clusters?

How it works?

A. It starts by calculating the distance between every pair of observation points and store it in a distance matrix.
B. It then puts every point in its own cluster.
C. Then it starts merging the closest pairs of points based on the distances from the distance matrix and as a result the amount of clusters goes down by 1.
D. Then it recomputes the distance between the new cluster and the old ones and stores them in a new distance matrix.
E. Lastly it repeats steps 2 and 3 until all the clusters are merged into one single cluster.

There are several ways to measure the distance between clusters in order to decide the rules for clustering, and they are often called Linkage Methods. Some of the common linkage methods are:

A. Complete-linkage: calculates the maximum distance between clusters before merging.
B. Single-linkage: calculates the minimum distance between the clusters before merging. This linkage may be used to detect high values in your dataset which may be outliers as they will be merged at the end.
C. Average-linkage: calculates the average distance between clusters before merging.
D. Centroid-linkage: finds centroid of cluster 1 and centroid of cluster 2, and then calculates the distance between the two before merging.

https://www.datacamp.com/community/tutorials/hierarchical-clustering-R

Dendrograms
In hierarchical clustering, you categorize the objects into a hierarchy similar to a tree-like diagram which is called a dendrogram. The distance of split or merge (called height) is shown on the y-axis of the dendrogram.

4. Measurement Parameters for Clustering (20:00 – 20:15)

Dunn’s Index
- introduced by J. C. Dunn in 1974
- to identify sets of clusters that are compact, with a small variance between members of the cluster, and well separated, where the means of different clusters are sufficiently far apart, as compared to the within cluster variance.
- Higher the Dunn index value, better is the clustering.
- Drawbacks: As the number of clusters and dimensionality of the data increase, the computational cost also increases.

David Bouldin Index
- Introduced by David L. Davies and Donald W. Bouldin in 1979
- Is made using quantities and features inherent to the dataset.
- Lower the DB index value, better is the clustering.
- Drawback: A good value reported by this method does not imply the best information retrieval
  https://www.geeksforgeeks.org/dunn-index-and-db-index-cluster-validity-indices-set-1/

Similarity Matrix
A similarity matrix, also known as a distance matrix, will allow you to understand how similar or far apart each pair of observations.
Hands on and Exercise of K-Means Clustering and Hierarchical

5. Hands on and Exercise of K-Means Clustering and Hierarchical

Please check this link:
https://drive.google.com/drive/folders/1BLsI98KyO6agatP_vw8fdQaSN6MOtKZV?usp=sharing

agglomerative clustering clustering data science hierarchical clustering k-means machine learning python

Catatan Ari Kuncoro