Lecture 6. Clustering Methods

Date: 2023-05-25

1. Overview

Clustering is an unsupervised learning method that groups similar data points together. Unlike supervised learning, clustering does not require pre-labeled groups. Instead, it uses a similarity (or distance) measure to determine how close data points are to each other. Clustering is useful for customer segmentation, anomaly detection, and topic extraction.

PCA vs. Clustering

	Goal	Method	Usage
PCA (Principal Component Analysis)	Dimensionality reduction. Transforms high-dimensional data into lower-dimensional form while retaining variance.	Utilizes orthogonal transformation to convert correlated features into linearly uncorrelated ones.	Pre-processing, visualization, noise reduction.
Clustering	Grouping similar data points together. No pre-labeled groups (unsupervised learning).	Various methods; similarity (or distance) measure is key.	Customer segmentation, anomaly detection, topic extraction.

Clustering Methods

Method	Description	Pros	Cons	Application
K-Means	Assigns data to the nearest cluster center; recalculates center as mean of cluster points.	Simple, scalable.	Assumes spherical clusters; sensitive to initial center placement.	Market segmentation, Document clustering.
Hierarchical Clustering	Builds tree of clusters. Can be bottom-up (agglomerative) or top-down (divisive).	No need for specifying cluster number; provides dendrogram insights.	Not scalable for large datasets.	Phylogenetic trees, Sociological studies.
DBSCAN	Clusters based on density. Labels sparse regions as noise.	Finds arbitrarily shaped clusters; no need for specifying cluster number.	Struggles with varying density clusters.	Spatial data analysis, Noise removal.
Gaussian Mixture Models (GMM)	Assumes data from several Gaussian distributions. Uses Expectation-Maximization.	Can model elliptical clusters.	Computationally intensive.	Image segmentation, Anomaly detection.
Agglomerative Clustering	Each data point starts as its own cluster; clusters are merged moving up the hierarchy.	Suitable for smaller datasets; produces a hierarchy.	Not scalable for large datasets.	Biological taxonomy, Hierarchical document clustering.

Distance/Similarity Measures

Measure	Description	Formula
Euclidean Distance	"Ordinary" straight-line distance between two points in Euclidean space.	$d(p, q) = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2}$
Manhattan Distance	Distance between two points traveled along axes at right angles (L1 norm).	$d(p, q) = \sum_{i=1}^{n} \\| p_i - q_i\\|_1$
Cosine Similarity	Measures the cosine of the angle between two non-zero vectors. Useful in high-dimensional spaces.	$\text{similarity}(A, B) = \frac{A \cdot B}{\\|A\\| \\|B\\|}$
Jaccard Similarity	Measures the similarity between finite sample sets. Used for comparing the similarity and diversity of sample sets.	$J(A, B) = \frac{\\|A \cap B\\|}{\\|A \cup B\\|}$
Mahalanobis Distance	Measures the distance between a point and a distribution, accounting for correlations.	$d(\mathbf{x}) = \sqrt{(\mathbf{x} - \mathbf{\mu})^T \mathbf{S}^{-1} (\mathbf{x} - \mathbf{\mu})}$

Where:

$p$ and $q$ are points in n-dimensional space.
$A$ and $B$ represent two vectors or sets.
$\mathbf{x}$ is a point, $\mathbf{\mu}$ is the mean of the distribution, and $\mathbf{S}$ is the covariance matrix.

Issues

Choice of Algorithm: Not all clustering methods are suitable for all types of data. The nature and shape of the data can influence the clustering results.
Determining K: In algorithms like K-Means, choosing the right number of clusters (k) is crucial but not straightforward.
Scaling: Features with larger scales can dominate the clustering process. Scaling the data is often necessary.
Outliers: Some clustering methods are sensitive to outliers, which can skew results.
Initialization Sensitivity: Algorithms like K-Means are sensitive to the initial placement of centroids.
Local Optima: Some algorithms might converge to a local optimum rather than a global optimum, leading to suboptimal clustering.

2. K-Means Clustering

Algorithm

Initialization: Select k initial centroids, where k is the number of clusters you want to classify your data into. These centroids can be random data points from the dataset.
Assignment: Assign each data point to the nearest centroid. This forms k clusters.
Recalculation: For each of the k clusters, compute the new centroid as the mean of all the data points in that cluster.
Reassignment: Reassign each data point to the nearest centroid.
Repeat: Continue the processes of recalculation and reassignment until the centroids no longer change significantly or a set number of iterations is reached.

Math

The objective of K-Means is to minimize the sum of squared distances between data points and their assigned cluster centroids.

Objective function $J$ can be represented as: $J = \sum_{i=1}^{k} \sum_{x \in C_i} ||x - \mu_i||^2$ Where: - $k$ = number of clusters. - $C_i$ = i-th cluster. - $\mu_i$ = centroid of i-th cluster. - $x$ = data point.

Pros and Cons

Pros: - Simple and intuitive algorithm. - Relatively efficient, especially for large datasets.

Cons: - Requires the number of clusters (k) to be specified beforehand. - Initial placement of centroids can affect the final output (can be mitigated using methods like K-Means++ for initialization). - Assumes clusters to be spherical and equally sized, which might not always be the case. - Sensitive to outliers.

Example

Let's consider a set of data points on a 2D plane. We want to group these data points into 3 clusters.

We start by randomly selecting 3 data points as our initial centroids.
Each data point is then assigned to the nearest centroid, forming 3 groups.
For each group, we calculate a new centroid.
We then reassign each data point to the nearest (newly computed) centroid.
This process is repeated until our centroids no longer move significantly.

At the end of this process, our data points are grouped into 3 distinct clusters based on their proximity to each centroid.

3. Hierarchical Clustering

Algorithm

Hierarchical clustering works by either dividing a dataset into smaller groups or by merging them until a desired structure is achieved. There are two main strategies:

Agglomerative (bottom-up):
Initialization: Each data point starts as its own cluster.
Merging: In each of the subsequent stages, the two clusters that are closest to each other are merged into a single cluster.
Termination: This continues until there's a single cluster containing all data points.
Divisive (top-down):
Initialization: The whole dataset is considered as one big cluster.
Division: In each stage, the biggest or the farthest cluster is divided into two.
Termination: This continues until each data point is its own cluster.

Math

The essence of hierarchical clustering lies in determining the "closeness" or "distance" between clusters. Several linkage methods determine this:

Single Linkage: Minimum pairwise distance. $d(A,B) = \min_{a \in A, b \in B} \text{dist}(a,b)$
Complete Linkage: Maximum pairwise distance. $d(A,B) = \max_{a \in A, b \in B} \text{dist}(a,b)$
Average Linkage: Average pairwise distance. $d(A,B) = \frac{1}{|A| \times |B|} \sum_{a \in A} \sum_{b \in B} \text{dist}(a,b)$
Centroid Linkage: Distance between the centroids of the clusters. $d(A,B) = \text{dist}(centroid(A), centroid(B))$

Where $\text{dist}(.,.)$ is typically the Euclidean distance.

Pros and Cons

Pros:

Provides a dendrogram, a tree-like diagram that gives a comprehensive view of the clustering process.
Does not require the number of clusters to be specified beforehand.

Cons:

Not scalable for large datasets due to its quadratic time complexity.
Once a merge or split decision is made, it cannot be undone.

Example

Imagine you have a small dataset of 5 documents: A, B, C, D, and E. You want to cluster them based on their similarity in content.

Using agglomerative hierarchical clustering:

Initially, each document is its own cluster: {A}, {B}, {C}, {D}, {E}.
After analyzing similarities, you find A and B are quite similar. So, you merge them: {A, B}, {C}, {D}, {E}.
Next, C and D seem closely related: {A, B}, {C, D}, {E}.
Finally, {C, D} and {E} are merged due to some shared topics: {A, B}, {C, D, E}.

You're left with two clusters of documents. This can be represented in a dendrogram, showing at which steps clusters were merged and offering insights into the hierarchical structure of the data.

4. Gaussian Mixture Models (GMM)

Algorithm

A Gaussian Mixture Model (GMM) represents a composite distribution whereby points are drawn from one of several Gaussian distributions, each identified by k components. In essence, GMM is a probabilistic model stating that all generated data points are derived from a mixture of several Gaussian distributions with unknown parameters.

The steps of the GMM algorithm are: 1. Initialize the Gaussian distribution parameters (means, variances, and mixture coefficients). 2. Expectation (E) Step: Estimate the probabilities of data points belonging to each Gaussian. 3. Maximization (M) Step: Adjust the model parameters to maximize the likelihood of the data given those assignments. 4. Repeat steps 2 and 3 until the model converges.

Math

The GMM is defined by: $p(x) = \sum_{i=1}^{k} \pi_i \mathcal{N}(x|\mu_i, \Sigma_i)$ Where: - $\pi_i$ is the mixture coefficient for the $i^{th}$ component. - $\mathcal{N}(x|\mu_i, \Sigma_i)$ is the Gaussian distribution with mean $\mu_i$ and covariance matrix $\Sigma_i$ .

The goal is to find the parameters ( $\pi, \mu, \Sigma$ ) that maximize the likelihood of the data. This is achieved using the Expectation-Maximization (EM) algorithm.

Pros and Cons

Pros: - Flexibility in cluster covariance: GMMs can model elliptical clusters, which might be an advantage over models like K-Means that assume spherical clusters. - Soft-clustering: Gives probabilistic cluster assignments which can be useful if you're uncertain about the true categorization of an instance.

Cons: - Computationally intensive, especially with a large number of components or large datasets. - Sensitive to the initialization of parameters. - Might converge to a local optimum. It's often a good idea to run the algorithm multiple times with different initializations.

Example

Imagine you're analyzing height data from a group of people, and you notice that the data appears to have two peaks in its histogram. This could suggest that there are two different groups or clusters in your data (e.g., male and female heights).

Using GMM, you model the data as being generated from two Gaussian distributions. After running the algorithm, you might find: - The first Gaussian has a mean at 5'5" (indicative of female heights). - The second Gaussian has a mean at 5'10" (indicative of male heights).

By doing this, you've probabilistically assigned each data point to one of the two Gaussian distributions, allowing for a more nuanced understanding of the height data.

5. Q&A

Q: What type of learning algorithm is K-Means Clustering? A: K-Means Clustering is an unsupervised learning algorithm used for partitioning data into distinct, non-overlapping groups.
Q: In K-Means Clustering, what does "K" represent? A: "K" represents the number of clusters into which the data is to be partitioned.
Q: What are the two main strategies of Hierarchical Clustering? A: The two main strategies are Agglomerative (bottom-up, where each data point starts as its own cluster and clusters merge) and Divisive (top-down, where the whole dataset starts as one cluster and gets divided).
Q: How is the "distance" or "closeness" between clusters determined in Hierarchical Clustering? A: The distance is determined using linkage methods such as Single Linkage (minimum pairwise distance), Complete Linkage (maximum pairwise distance), Average Linkage (average pairwise distance), or Centroid Linkage (distance between the centroids of the clusters).
Q: What probabilistic model does Gaussian Mixture Models (GMM) use to represent data? A: GMM represents data as if it were derived from a mixture of several Gaussian distributions.
Q: What algorithmic approach is commonly used to estimate the parameters of GMM? A: The Expectation-Maximization (EM) algorithm is commonly used.
Q: How does the Cosine Similarity metric measure similarity between two vectors? A: Cosine Similarity measures the cosine of the angle between two non-zero vectors. It's especially useful in high-dimensional spaces.
Q: What is the primary difference between Euclidean and Manhattan distances? A: Euclidean distance is the "ordinary" straight-line distance between two points in Euclidean space, while Manhattan distance, also known as L1 norm, is the distance between two points traveled along axes at right angles.
Q: In which clustering method would you obtain a dendrogram? A: A dendrogram is obtained in Hierarchical Clustering.
Q: Why might one choose GMM over K-Means Clustering? A: GMM is chosen over K-Means when there's a belief that the data is generated from a mixture of several Gaussian distributions. It's more flexible than K-Means, as it allows for elliptical clusters and provides a probabilistic cluster assignment.
Q: Why is it important to scale features before applying K-Means Clustering? A: Scaling ensures that all features contribute equally to the computation of distances. Without scaling, features with larger magnitudes can disproportionately influence cluster assignments.
Q: How does the choice of distance metric affect Hierarchical Clustering? A: The choice of distance metric can greatly influence the shape and structure of the dendrogram, and consequently, the resulting clusters. Different metrics may yield different hierarchies and cluster shapes.
Q: What is a key limitation of using K-Means Clustering for datasets with clusters of varying sizes and densities? A: K-Means assumes that clusters are spherical and equally sized, which can lead to poor performance when encountering clusters of different shapes, sizes, and densities.
Q: Why might one prefer Hierarchical Clustering over K-Means Clustering or GMM? A: Hierarchical Clustering doesn't require pre-specifying the number of clusters and it provides a dendrogram that offers insights into the nested structure of clusters.
Q: What's the main difference between hard and soft clustering in the context of GMM? A: Hard clustering assigns each data point to exactly one cluster, while soft clustering (as in GMM) provides a probabilistic assignment, indicating the likelihood of a point belonging to each cluster.
Q: What is the silhouette coefficient, and how is it used in clustering? A: The silhouette coefficient measures how close each point in one cluster is to the points in the neighboring clusters. It's a metric used to evaluate the validity of a clustering solution.
Q: How does the "elbow method" help determine the optimal number of clusters for K-Means? A: The elbow method involves plotting the explained variation as a function of the number of clusters, and picking the "elbow" of the curve as the number of clusters to use. The "elbow" typically represents an inflection point where adding more clusters doesn't provide much better fit to the data.
Q: In GMM, what is the significance of the Expectation-Maximization (EM) algorithm? A: EM iteratively optimizes the likelihood of observing the data under the model. The "Expectation" step estimates the probability of data points belonging to clusters, and the "Maximization" step adjusts the model parameters based on these probabilities.
Q: How does the Mahalanobis distance differ from the Euclidean distance? A: Mahalanobis distance takes into account the correlations in the data. It measures distance relative to the centroid of a distribution, scaled by the covariance matrix. This makes it more suitable for datasets where features are correlated.
Q: What role does the initial placement of centroids play in K-Means Clustering? A: The initial placement of centroids can significantly impact the final clustering result. Poor initialization might lead the algorithm to converge to a local minimum, yielding sub-optimal clusters.