Lecture 7. Linear Model Selection and Regularization

Date: 2023-03-14

1. Overview

Problems with Least Squares

Prediction Accuracy

$n$ vs. $p$ variance

$n \gg p$ low

$n \approx p$ elevated

$n < p$ not viable

By limiting or moderating the estimated coefficients, we can typically minimize variance significantly while only slightly increasing bias.
Model Interpretability

Discarding insignificant variables from the model generally enhances its clarity and understandability.

Methods for Improving Least Squares

Subset Selection

This involves pinpointing a subset of the p predictors that are believed to have a correlation with the response.
Shrinkage

This method involves modeling all p predictors by using a technique that moderates the coefficient estimates, or in other words, pulls the coefficient estimates closer to zero.
Dimension Reduction

This is about mapping the p predictors onto an M-dimensional space. Detailed in Lecture 8.

2. Subset Selection

Subset selection methods focus on identifying a subset of the predictors that are believed to be related to the response. The objective is to identify a model that strikes a balance between fitting the data well (low bias) and having a low predictive error on new observations (low variance).

Best Subset Selection

Best Subset Selection involves identifying the "best" model for each subset size. For a dataset with $p$ predictors:

Fit all possible models with one predictor. Choose the best one based on some criterion (e.g., lowest RSS or highest $R^2$ ).
Fit all possible models with two predictors. Again, choose the best.
Repeat until you've considered all $p$ predictors.

Out of these, a single best model is selected using a model assessment criterion, like Cross-Validated Prediction Error, AIC, BIC, or Adjusted $R^2$ . However, this approach is computationally expensive, as the number of models to be fit grows exponentially with the number of predictors.

Stepwise Selection

A more efficient approach is to use a greedy algorithm that sequentially adds or removes predictors based on some criterion until a stopping criterion is reached.

Stepwise Selection is a computationally efficient alternative to best subset selection.

Forward Stepwise Selection: Starts with a model with no predictors and adds predictors one at a time, until a stopping criterion is reached.
Backward Stepwise Selection: Starts with the full model and removes predictors one at a time based on some criterion.
Hybrid Stepwise Selection: A combination of forward and backward stepwise selection.

Choosing the Optimal Model

Selecting the right model often requires balancing complexity (number of predictors) with the model's goodness-of-fit:

Residual Sum of Squares (RSS) and $R^2$ : Typically, as more predictors are added, the RSS decreases, and $R^2$ increases.
AIC, BIC, and Adjusted $R^2$ : Introduce a penalty for each predictor added to the model. These can be used to identify which model is best.
Cross-Validation: It provides a direct estimate of the test error and can be used to identify the model that would perform best on unseen data.

3. Shrinkage Methods

Shrinkage methods, often called regularization methods, constrain or regularize the coefficient estimates towards zero. By doing so, they can reduce the variance of predictions and provide better model interpretability. Two common shrinkage methods are Ridge Regression and the Lasso.

Ridge Regression

Ridge Regression introduces a penalty term to the least squares objective that is proportional to the $L_2$ norm (Euclidean norm) of the coefficients.

The ridge regression estimates, $\hat{\beta}^{ridge}$ , are the values that minimize:

$RSS_{ridge} = \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} \beta_j^2$

Where:

$RSS_{ridge}$ is the Ridge Regression objective.
$\lambda$ is a tuning parameter that determines the amount of shrinkage. As $\lambda$ increases, the penalty for larger coefficients grows, pushing the estimates towards zero. When $\lambda = 0$ , Ridge Regression becomes identical to least squares linear regression.

The Lasso

The Lasso, short for Least Absolute Shrinkage and Selection Operator, is similar to Ridge Regression but uses the $L_1$ norm (Manhattan norm) for regularization. A key feature of the Lasso is that it can force some coefficients to be exactly zero, effectively selecting a simpler model.

The Lasso estimates, $\hat{\beta}^{lasso}$ , are the values that minimize:

$RSS_{lasso} = \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} | \beta_j |$

Where:

$RSS_{lasso}$ is the Lasso objective.
$\lambda$ is a tuning parameter similar to Ridge Regression. A larger value of $\lambda$ will push more coefficients to be exactly zero.

$L_p$ Regression

The idea behind $L_p$ regression is to generalize Ridge and Lasso regression by introducing a penalty term proportional to the $L_p$ norm of the coefficients. This encompasses a range of models, including Ridge and Lasso as special cases when $p = 2$ and $p = 1$ , respectively.

The $L_p$ regression estimates, $\hat{\beta}^{L_p}$ , are the values that minimize:

$RSS_{L_p} = \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} | \beta_j |^p$

Where:

$RSS_{L_p}$ is the $L_p$ regression objective.
$p$ can take any non-negative value.
$\lambda$ is a tuning parameter that determines the amount of shrinkage applied to the coefficients.

norm_ball_shapes

p	Common Name	Norm Ball Shape	Special Usage/Meaning	Note
0	"Zero norm"	Count of non-zero elements	Encourages sparsity	Not a true norm, but is used to denote sparsity
1	Lasso (L1 norm)	Rhomboid (diamond in 2D)	Produces sparse solutions	Computationally tractable sparsity
2	Ridge (L2 norm)	Circle in 2D, sphere in 3D, etc.	Shrinks coefficients uniformly	Euclidean norm, smooth solution
3	$L_3$ norm	Shape between the sphere and diamond	Rarely used
...	...	...	...	...
$\infty$	Max norm (L $\infty$ norm)	Hyperrectangle (box in 2D)	Bound on absolute value of elements	Limits the maximum value of any element in vector

Selecting the Tuning Parameter

The tuning parameter, often represented by $\lambda$ , determines the level of regularization in methods like Ridge and Lasso. Proper selection can notably impact a model's performance.

Cross-Validation

Use k-fold cross-validation: Divide data into $k$ parts, train on $k-1$ parts, and validate on the remaining part. The optimal $\lambda$ gives the lowest average validation error.

Regularization Path Algorithms

For techniques like Lasso, algorithms like the least angle regression can compute the model for a range of $\lambda$ values, aiding in the selection process.

Information Criteria

Use measures like the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC) to assess model fit and guide $\lambda$ choice.

Tips:

Always standardize predictors before regularization.
Initially use a broad range for $\lambda$ in cross-validation, then narrow down.
Be mindful of computational costs in large datasets.
Avoid over-tuning to prevent overfitting.

Elastic Net

Elastic Net regression is a linear regression model trained with both $L_1$ and $L_2$ regularization of the coefficients. It combines the properties of both Ridge Regression (L2 regularization) and Lasso Regression (L1 regularization).

Formulation:

The cost function to minimize for Elastic Net is:

$J(\theta) = \frac{1}{2n}\|y - X\theta\|^2_2 + \alpha \rho \| \theta \|_1 + \frac{\alpha (1 - \rho)}{2} \| \theta \|_2^2$

Where:

$J(\theta)$ is the cost function.
$y$ is the observed output.
$X$ is the input matrix.
$\theta$ is the vector of model parameters.
$n$ is the number of samples.
$\alpha$ is a non-negative hyperparameter that scales the overall influence of the regularization term.
$\rho$ is a hyperparameter that controls the mixing between L1 and L2 regularization. When $\rho = 1$ , Elastic Net is equivalent to Lasso regression; when $\rho = 0$ , it's equivalent to Ridge regression.

Characteristics:

Variable Selection & Regularization: Elastic Net incorporates the strengths of both Lasso and Ridge regression. It can remove weak variables entirely (like Lasso) and also regularize correlated predictors by distributing coefficients among them (like Ridge).
Overfitting: By adding regularization, Elastic Net helps prevent overfitting on datasets.
Multicollinearity: Elastic Net is particularly useful when dealing with datasets that have multicollinearity.

When to Use:

When there's a reason to believe that many features are irrelevant or redundant. Elastic Net will perform feature selection like Lasso but also handle multicollinearity between variables better due to its Ridge component.
When the number of predictors is greater than the number of observations.

Caveats:

While Elastic Net can be a powerful tool, it's crucial to adjust its hyperparameters correctly. Using cross-validation to find the best combination of alpha and l1_ratio is often recommended. Additionally, like other regularized regression methods, feature scaling is essential for optimal performance.

4. Q&A

1. What is subset selection in the context of linear regression?
Subset selection involves identifying and including only a subset of predictors in a linear regression model, aiming to improve the model's interpretability and performance by excluding irrelevant predictors.

2. What is the primary goal of best subset selection?
The primary goal of best subset selection is to identify the best model (in terms of prediction accuracy) for each possible number of predictors by evaluating all possible combinations.

3. How does forward stepwise selection differ from backward stepwise selection?
Forward stepwise selection starts with no predictors and adds one predictor at a time, whereas backward stepwise selection starts with all predictors and removes one predictor at a time.

4. Why might we prefer stepwise selection over best subset selection?
Stepwise selection is computationally less expensive than best subset selection, especially when dealing with a large number of predictors.

5. How is the "best" model typically determined in subset selection methods?
The "best" model is often determined using criteria like Adjusted R-squared, AIC, BIC, or cross-validation error.

6. Why is it not always ideal to include all predictors in a regression model?
Including all predictors can lead to overfitting, reduced model interpretability, and multicollinearity issues.

7. Are there any risks associated with aggressive subset selection?
Yes, aggressive subset selection can lead to underfitting, where the model may miss important relationships between predictors and the response.

8. How can cross-validation be used in the context of subset selection?
Cross-validation can help in assessing the prediction performance of different subsets of predictors and guide the selection of the most suitable subset.

9. In what scenarios might best subset selection become impractical?
Best subset selection can become impractical when the number of predictors is large, due to the exponential growth in possible models to evaluate.

10. What role does multicollinearity play in the subset selection process?
Multicollinearity can make it difficult to identify the best subset because highly correlated predictors can be interchangeably significant. Proper subset selection can help in addressing multicollinearity by excluding redundant predictors.

11. What are shrinkage methods in the context of regression?
Shrinkage methods, also known as regularization techniques, introduce a penalty to the regression, which "shrinks" or constrains the coefficient estimates towards zero or each other.

12. Why might one prefer ridge regression over traditional least squares regression?
Ridge regression can be preferable when there's multicollinearity among predictors or when the model risks overfitting. By adding a penalty term, ridge regression reduces the model's complexity and can improve prediction accuracy.

13. How does the Lasso differ from ridge regression in terms of coefficient shrinkage?
While both methods shrink coefficients, Lasso can shrink some coefficients to exactly zero, effectively performing variable selection, whereas ridge regression only reduces the magnitude of coefficients but doesn't set them to zero.

14. What's the significance of the tuning parameter (often denoted as λ or alpha) in shrinkage methods?
The tuning parameter controls the strength of the penalty in shrinkage methods. A value of zero means no penalty (ordinary least squares), while a larger value increases the amount of shrinkage.

15. How do you determine the optimal value for the tuning parameter?
The optimal value for the tuning parameter is typically determined using cross-validation, where the value that results in the lowest test error rate is chosen.

16. What is the primary advantage of using Lasso regression?
The primary advantage of Lasso regression is its ability to perform both regularization and variable selection simultaneously.

17. In the context of shrinkage, what is the "bias-variance trade-off"?
The bias-variance trade-off refers to the balance between flexibility and model accuracy. As we increase the penalty term (using a higher tuning parameter), the variance of the estimates decreases but the bias increases, and vice versa.

18. How does elastic net combine ridge and Lasso regression?
Elastic net combines the penalties of ridge and Lasso regression, allowing for both types of shrinkage and variable selection. It's especially useful when there are many correlated predictors.

19. Why might shrinkage methods be preferable when $p > n$ (more predictors than observations)?
In situations where $p > n$ , traditional least squares regression isn't feasible. Shrinkage methods can still produce coefficient estimates in such scenarios by imposing penalties and reducing the risk of overfitting.

20. What's the potential downside of using a very high penalty term in ridge regression?
A very high penalty term can overly constrain the coefficients, leading to a model that's too simple and potentially missing relevant relationships between predictors and the response.

$n$ vs. $p$	variance
$n \gg p$	low
$n \approx p$	elevated
$n < p$	not viable

Lecture 7. Linear Model Selection and Regularization

1. Overview

Problems with Least Squares

Methods for Improving Least Squares

2. Subset Selection

Best Subset Selection

Stepwise Selection

Choosing the Optimal Model

3. Shrinkage Methods

Ridge Regression

The Lasso

L_p Regression

Selecting the Tuning Parameter

Cross-Validation

Regularization Path Algorithms

Information Criteria

Tips:

Elastic Net

Formulation:

Characteristics:

When to Use:

Caveats:

4. Q&A

$L_p$ Regression