Lecture 8. Dimension Reduction Methods

Date: 2023-03-21

1. Overview

Dimension reduction techniques fundamentally operate in a two-step sequence. Initially, transformed predictors, represented as $Z_1, Z_2, \ldots, Z_M$ , are determined. Subsequently, the model is trained utilizing these $M$ predictors. The intricacies lie in how we derive $Z_1, Z_2, \ldots, Z_M$ or, in other words, how we pick the coefficients $φ_{jm}$ . In this lecture, we'll delve into two prominent strategies for this purpose: principal components analysis (PCA) and partial least squares (PLS).

By reducing dimensionality, we aim to capture the essence of the data in fewer variables, often leading to improved model interpretability and performance, especially when dealing with multicollinearity or when the original data contains many irrelevant features. Both PCA and PLS are techniques that help us achieve this by constructing new sets of variables (or predictors) that are linear combinations of the original ones, but they do so with slightly different goals and methodologies.

2. Principal Component Analysis (PCA)

Introduction

Principal Component Analysis, often abbreviated as PCA, is one of the most widely used techniques in dimensionality reduction, especially when it comes to quantitative data. Its primary goal isn't necessarily to predict the response, but rather to capture the main patterns and structures in the data.

Mechanics of PCA

Standardization: Given that PCA is sensitive to variances, it's common practice to standardize each predictor to have a mean of zero and standard deviation of one, especially when predictors are on different scales.
Covariance Matrix Computation: Once data is standardized, the covariance (or correlation) matrix of the predictors is computed.
Eigen Decomposition: The covariance matrix is then decomposed into its eigenvectors and eigenvalues. The eigenvectors (principal component directions) determine the new feature space, and the eigenvalues determine the magnitude or explained variance of each principal component.
Projection: The original data is then projected onto these principal component directions to get principal component scores.

Principal Components

The first principal component (PC1) captures the maximum variance in the data. Each subsequent principal component (PC2, PC3, and so on) captures the maximum remaining variance while being orthogonal to the preceding components. This orthogonality ensures that the principal components are uncorrelated with each other.

Interpretation

The principal components themselves aren't always directly interpretable in the context of the original data, but they serve as a set of new axes that best summarize the variance-carrying information of the original data.

Variance Explained

Often, only the first few principal components are needed to capture a substantial portion of the total variance in the data. A scree plot or a cumulative variance explained plot can help in determining the number of principal components to retain.

Benefits and Limitations

Benefits:
- Reduces the dimensionality of data, often without much loss of information.
- Mitigates issues related to multicollinearity in regression models.
- Provides a tool for visualization of high-dimensional data.
Limitations:
- Principal components can be challenging to interpret.
- Assumes linear relationships among variables.
- It is sensitive to outliers.

Use Cases

While the main intent of PCA isn't prediction, the transformed predictors (principal component scores) can be used in subsequent regression, clustering, or classification tasks. It's also popular in image processing, genomics, and other fields where data sets with a large number of variables are common.

Mathematical Foundation of PCA

1. Standardization

For a dataset $X$ of $n$ observations and $p$ variables, we standardize each variable:

$X_{ij}^{*} = \frac{X_{ij} - \bar{X}_j}{s_j}$

Where:

$X_{ij}$ is the original value of the $j^{th}$ variable for the $i^{th}$ observation.
$\bar{X}_j$ is the mean of the $j^{th}$ variable.
$s_j$ is the standard deviation of the $j^{th}$ variable.

2. Covariance Matrix

For the standardized data, we calculate the covariance matrix $S$ :

$S = \frac{1}{n-1} X^{*T} X^{*}$

Where:

$X^{*}$ is the matrix of standardized data.
$X^{*T}$ is the transpose of $X^{*}$ .

3. Eigen Decomposition

The covariance matrix $S$ is symmetric, so it can be decomposed into its eigenvectors and eigenvalues:

$S V = V D$

Where:

$V$ is the matrix whose columns are the eigenvectors of $S$ .
$D$ is the diagonal matrix whose diagonal entries are the eigenvalues of $S$ .

4. Principal Components

The principal components are linear combinations of the original predictors:

$Z = X^{*} V$

Where:

$Z$ is the matrix of principal components.
$Z_{i1}$ (the first column of $Z$ ) contains the scores of the first principal component.
$Z_{i2}$ (the second column of $Z$ ) contains the scores of the second principal component, and so on.

The variance explained by the $k^{th}$ principal component is given by the ratio of the $k^{th}$ eigenvalue to the sum of all eigenvalues. The proportion of total variance explained by the first $m$ components is given by:

$\frac{\sum_{i=1}^{m} \lambda_i}{\sum_{i=1}^{p} \lambda_i}$

Where:

$\lambda_i$ is the $i^{th}$ eigenvalue.

5. Number of Components

Often, a threshold (e.g., 95% of variance) is set to decide the number of principal components to retain. A scree plot or cumulative variance plot can help in this decision, which visualizes the eigenvalues or the cumulative variance explained.

Notes:

PCA relies on orthogonal transformations to convert correlated features (variables) of possibly non-standardized data into a set of linearly uncorrelated features called principal components. This orthogonal transformation is defined in such a way that the first principal component explains the highest variance, and each succeeding component has the highest variance possible under the constraint that it's orthogonal to the preceding components.
The directions of maximum variance (principal component directions) are the eigenvectors of the covariance matrix, and the magnitude of these maximum variances are the corresponding eigenvalues.
If the data is normalized (mean zero and variance one), then the covariance matrix is simply the inner product of the data with itself divided by the number of observations.

3. Principal Component Regression (PCR)

PCR is a two-step procedure:

Perform PCA on the predictor matrix $X$ to obtain the principal components.
Use these principal components as predictors in a linear regression model with the target variable $Y$ .

1. PCA on Predictor Matrix

For a predictor matrix $X$ of size $n \times p$ (where $n$ is the number of observations and $p$ is the number of predictors), we:

Standardize the predictors:

$X_{ij}^{*} = \frac{X_{ij} - \bar{X}_j}{s_j}$

Calculate the covariance matrix:

$S = \frac{1}{n-1} X^{*T} X^{*}$

Compute the eigenvectors $V$ and eigenvalues $D$ of $S$ :

$S V = V D$

Form the principal components:

$Z = X^{*} V$

2. Regression on Principal Components

Now, let $Z_m$ be the matrix of the first $m$ principal components. We regress $Y$ on $Z_m$ without an intercept:

$Y = Z_m B_m + \epsilon$

Where: - $B_m$ are the regression coefficients. - $\epsilon$ is the error term.

The coefficients $B_m$ can be found using least squares:

$B_m = (Z_m^T Z_m)^{-1} Z_m^T Y$

Now, to express the model in terms of the original predictors, we can transform back using:

$\hat{B}^{PCR} = V_m B_m$

Where: - $V_m$ is the matrix of eigenvectors corresponding to the first $m$ principal components. - $\hat{B}^{PCR}$ is the coefficient estimate for PCR in the original predictor space.

Notes:

PCR provides a balance between the ordinary least squares (OLS) estimates and the estimates that arise when the principal components with smaller eigenvalues (and thus explaining less variance) are discarded.
By working with fewer components, PCR can help mitigate multicollinearity.
One challenge with PCR is deciding how many principal components $m$ to retain. This can often be determined via cross-validation.
The principal components used in PCR are orthogonal, ensuring that multicollinearity is not an issue in the regression.
However, it's worth noting that PCR does not consider the response $Y$ when forming the principal components, so there's no guarantee that the directions of maximum variance will also be the directions most relevant for predicting $Y$ .

4. t-SNE (t-distributed Stochastic Neighbor Embedding)

Overview:

t-SNE is a non-linear dimensionality reduction technique that aims to preserve local structures in the data. It’s particularly useful for visualizing complex datasets where linear methods like PCA might be insufficient. The main idea behind t-SNE is to ensure that similar points in the high-dimensional space remain close in the low-dimensional embedding, while dissimilar points remain far apart.

How t-SNE works:

Probabilistic Modeling in High-Dimensional Space: For every pair of points in the high-dimensional space, a probability $p_{ij}$ is computed, representing the similarity between points $x_i$ and $x_j$ . This is done using a Gaussian distribution:

$p_{ij} = \frac{\exp(-||x_i - x_j||^2 / 2\sigma^2)}{\sum_{k \neq l} \exp(-||x_k - x_l||^2 / 2\sigma^2)}$

Where $\sigma$ is a variance specific to each point.

Probabilistic Modeling in Low-Dimensional Space: Similarly, for every pair of points in the low-dimensional space, a probability $q_{ij}$ is computed. This is done using a Student’s t-distribution (with one degree of freedom, equivalent to the Cauchy distribution):

$q_{ij} = \frac{(1 + ||y_i - y_j||^2)^{-1}}{\sum_{k \neq l} (1 + ||y_k - y_l||^2)^{-1}}$

Here, $y_i$ and $y_j$ are the mapped points in the low-dimensional space.

Minimize the Difference Between Distributions: t-SNE minimizes the divergence between the two distributions (from high-dimensional to low-dimensional space) using the Kullback-Leibler divergence:

$KL(P||Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}$

The optimization is typically performed using gradient descent.

Key Characteristics:

Tuneable Parameters:
Perplexity: Roughly determines how many close neighbors each point considers. Typically, values between 5 and 50 are recommended.
Learning Rate: Determines step size during optimization. Too high or too low rates may prevent convergence.
Random Initialization: The low-dimensional representations are often initialized randomly, leading to potential variability between runs.
Non-convexity: Due to the non-convex nature of the cost function, different initializations might lead to different embeddings. Hence, it might be advisable to run t-SNE multiple times.
Preservation of Local Structure: t-SNE excels at preserving local structures but doesn't guarantee the preservation of global structures. Thus, the relative distances between clusters in the visualization may not always be meaningful.

Limitations:

Computational Cost: t-SNE is computationally intensive, especially for large datasets.
Interpretability: The reduced dimensions by t-SNE are not directly interpretable.
No Guarantee for Reproducibility: Due to its stochastic nature and dependence on hyperparameters, t-SNE can produce different results on different runs.

Usage:

t-SNE is mainly used for data exploration and visualization. While it can technically be used for tasks like clustering, its stochastic nature makes it less suitable for such deterministic tasks.

5. Q&A

PCA (Principal Component Analysis)

Q: What is the primary objective of PCA? A: The primary objective of PCA is to reduce the dimensionality of a dataset by finding a set of orthogonal axes (principal components) that capture the maximum variance in the data.
Q: How are principal components ordered in PCA? A: Principal components are ordered by the amount of variance they capture from the original dataset, with the first principal component capturing the most variance.
Q: Can PCA be used for data with categorical variables? A: Traditionally, PCA is designed for continuous variables. However, variations of PCA, like CATPCA or MCA (Multiple Correspondence Analysis), have been developed for categorical data.
Q: How does scaling of variables impact PCA? A: Scaling is crucial in PCA. If variables have different scales, PCA might be unduly influenced by the variables with larger scales. Therefore, it's standard practice to scale (normalize) variables before applying PCA.
Q: Does PCA make any assumptions about the underlying data? A: Yes, PCA assumes that the data's variables are linearly correlated and that the highest variance direction is the most important.

PCR (Principal Component Regression)

Q: How does PCR differ from standard regression techniques? A: PCR first performs PCA on the predictor variables and then uses the principal components as predictors in a linear regression model, rather than the original predictors.
Q: Why might one use PCR over standard linear regression? A: PCR can be helpful when predictor variables are highly collinear since it uses orthogonal principal components as predictors, which inherently removes multicollinearity.
Q: Does PCR always use all the principal components for regression? A: No, often a subset of principal components is selected to achieve a balance between bias and variance.
Q: How do you choose the number of principal components in PCR? A: Cross-validation is commonly used to select the optimal number of principal components in PCR to minimize prediction error.
Q: Can PCR be used for both regression and classification tasks? A: While the typical use case of PCR is regression, it can be adapted for classification by applying PCA to the predictors and then using a classification algorithm on the resulting principal components.

t-SNE (t-distributed Stochastic Neighbor Embedding)

Q: What is the main advantage of t-SNE over linear dimensionality reduction methods like PCA? A: t-SNE is a non-linear method that is particularly adept at preserving local structures, making it more suitable for datasets where linear methods like PCA may fail to capture intricate patterns.
Q: Why might t-SNE visualizations look different across different runs? A: t-SNE has a stochastic nature and depends on hyperparameters. Due to its non-convex cost function and random initialization, different runs might produce different embeddings.
Q: What is perplexity in t-SNE, and how does it influence the algorithm? A: Perplexity is a hyperparameter in t-SNE that roughly determines how many close neighbors each point considers. Adjusting the perplexity can change the balance between preserving local vs. global structures in the data.
Q: Can t-SNE be used for tasks beyond visualization? A: While t-SNE is mainly used for visualization, it can technically be applied for tasks like clustering. However, its stochastic nature makes it less ideal for deterministic tasks.
Q: Is the distance between clusters in a t-SNE plot indicative of their separation in the original space? A: Not necessarily. While t-SNE excels at preserving local structures, the relative distances between clusters in the visualization might not always reflect their separation in the high-dimensional original space.