Lecture 2. Linear Regression

Date: 2023-02-07

1. Maximum Likelihood Estimation (MLE)

Introduction

Maximum Likelihood Estimation (MLE) is a method used to estimate the parameters of a statistical model. Given a set of data and a statistical model, MLE finds the parameter values that make the observed data most probable under the assumed model. The idea is to choose the parameters that maximize the likelihood function, which measures how well the model explains the observed data.

Likelihood vs. Probability

Probability:
- Refers to the measure of the chance of an event occurring.
- It's usually defined on a sample space.
- For a given event, probabilities always lie between 0 and 1, inclusive.
- The sum of probabilities for all possible outcomes (in a well-defined space) is always 1.
- Example: If you flip a fair coin, the probability of getting heads is 0.5.
Likelihood:
- Relates to how well your data support a particular parameter value for a statistical model.
- It’s a function of the parameters given observed data.
- Likelihood values aren’t bounded between 0 and 1.
- They don’t sum up to 1 and can be any non-negative value.
- Likelihoods are commonly used in the context of parameter estimation, especially with Maximum Likelihood Estimation (MLE).
- Example: Suppose you have a binomial model. The likelihood would express how plausible different values of the binomial probability parameter are, given the observed data.

In simple terms: - You'd use probability to describe how likely a future event is to happen based on a model. - You'd use likelihood to describe how plausible a particular model is given some observed data.

Algorithm

For a given data set $x_1, x_2, \dots, x_n$ and a probability distribution $f(x; \theta)$ parameterized by $\theta$ , the likelihood function $L(\theta)$ is given by:

$L(\theta) = f(x_1; \theta) \times f(x_2; \theta) \times \dots \times f(x_n; \theta)$

For computational ease, we often work with the log-likelihood:

$\log(L(\theta)) = \sum_{i=1}^{n} \log(f(x_i; \theta))$

The MLE $\hat{\theta}$ is the value of $\theta$ that maximizes the (log-)likelihood:

$\hat{\theta} = \arg\max_{\theta} \log(L(\theta))$

This is typically found using optimization techniques.

Example

Suppose we have a sample from a normal distribution $x_1, x_2, \dots, x_n$ and we want to estimate the mean $\mu$ and variance $\sigma^2$ using MLE.

For a single observation $x$ , the probability density function of the normal distribution is:

$f(x; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$

The closer the value of $x$ is to $\mu$ , the higher the probability density.

Let's first put it in the likelihood form:

$L(\mu, \sigma^2) = \prod_{i=1}^{n} f(x_i; \mu, \sigma^2)$

For computational ease, we'll work with the log-likelihood:

$\log(L(\mu, \sigma^2)) = \sum_{i=1}^{n} \log(f(x_i; \mu, \sigma^2))\\ = \sum_{i=1}^{n} \log\left(\frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x_i-\mu)^2}{2\sigma^2}}\right)\\ = -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (x_i - \mu)^2$

To find the MLEs $\hat{\mu}$ and $\hat{\sigma^2}$ , we differentiate the log-likelihood w.r.t. $\mu$ and $\sigma^2$ , equate to zero, and solve. We are actually maximizing the log-likelihood w.r.t. $\mu$ and $\sigma^2$ . This yields:

$\hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} x_i$

$\hat{\sigma^2} = \frac{1}{n} \sum_{i=1}^{n} (x_i - \hat{\mu})^2$

Thus, the MLE for the mean is the sample mean, and the MLE for the variance is the sample variance.

2. Simple Linear Regression

Introduction

Simple Linear Regression is a linear approach to modeling the relationship between a dependent variable $y$ and one independent variable $x$ . It assumes a linear relationship between them. The goal is to find the best-fitting straight line that predicts $y$ from $x$ .

The model is represented as: $y = \beta_0 + \beta_1x + \epsilon$

Where: - $y$ is the dependent variable. - $x$ is the independent variable. - $\beta_0$ is the intercept. - $\beta_1$ is the slope of the line. - $\epsilon$ represents the error term (residuals).

Algorithm

The aim is to find $\beta_0$ and $\beta_1$ that minimize the sum of squared residuals. Using the method of ordinary least squares, the estimators for the intercept and slope are given by:

$\beta_1 = \frac{n(\sum xy) - (\sum x)(\sum y)}{n(\sum x^2) - (\sum x)^2}$

$\beta_0 = \frac{\sum y - \beta_1(\sum x)}{n}$

Where: - $\sum x$ is the sum of all x-values. - $\sum y$ is the sum of all y-values. - $\sum xy$ is the sum of the product of each pair of x and y values. - $n$ is the number of observations.

Example

Suppose we have data on the number of hours studied (x) and the exam scores (y) for a group of students. We want to predict exam scores based on the number of hours studied.

Let's say from the given data: - $\sum x = 50$ (total hours) - $\sum y = 400$ (total scores) - $\sum xy = 2200$ - $\sum x^2 = 350$ - $n = 10$ (number of students)

Plugging the values into our formulas:

$\beta_1 = \frac{10(2200) - (50)(400)}{10(350) - (50)^2} = 2$

$\beta_0 = \frac{400 - 2(50)}{10} = 30$

The linear regression model becomes: $y = 30 + 2x$

This suggests that for every additional hour studied, the exam score increases by 2 points, starting from a base score of 30 points when no hours are studied.

Metrics of Fit for Linear Regression

Linear regression model evaluation involves two main facets: assessing the significance of individual predictors and evaluating the overall model's performance. Here's a breakdown of commonly used metrics for both aspects:

Metrics for Assessing Significance of Individual Predictors

Metric/Test	Formula (or Definition)	Rationale	Range	Interpretation	Pros	Cons	When to Use
P-value for the Regression	Depends on t-distribution: $t = \frac{b_1 - 0}{SE(b_1)}$	Checks statistical significance of predictors.	0 to 1	Smaller p-value (<0.05 often) suggests significant predictor.	Standardized measure; widely recognized.	Can be misleading; doesn't measure effect size.	Check significance of individual predictor relationships.

An estimator is unbiased if, on average, it correctly estimates the parameter it's supposed to estimate. An estimator is consistent if it converges to the true parameter value as the sample size increases.

Metrics for Assessing Overall Model Fit:

Metric/Test	Formula (or Definition)	Rationale	Range	Interpretation	Pros	Cons	When to Use
R^2 (Coefficient of Determination)	$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$	Quantifies variance captured by the predictors collectively.	0 to 1	Proportion of variance in dependent variable explained by the model. Higher is generally better.	Easy to interpret; values between 0 and 1.	Can be overly optimistic; doesn't confirm causal relationship.	General sense of overall model fit.
F-test	$F = \frac{\text{variance explained by the model}}{\text{variance not explained by the model}}$	Compares the model's fit to data against a no-predictor model.	0 to ∞	Larger F value suggests a significant model overall.	Useful for comparing models; holistic measure.	In simple regression, tests a similar hypothesis as the t-test for overall model significance.	Comparing nested models or in multiple regression for overall significance.
Residual Analysis	Residual = Observed value - Predicted value	Validates model assumptions by analyzing overall error patterns.	Varies based on data	Patternless residuals suggest model fits the overall data structure; look for trends or outliers in plots.	Provides holistic insights into fit, outliers, assumptions.	Requires subjective judgment.	Throughout the modeling process to validate overall model fit.
RMSE	$RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}$	Measures the average magnitude of residuals for the entire model.	0 to ∞	Lower RMSE indicates better overall fit; gives error magnitude in units of dependent variable.	Absolute measure of fit; units of dependent variable.	Sensitive to outliers.	To gauge the overall magnitude of model prediction errors.

Note: For simple linear regression, R^2 and the square of the Pearson correlation coefficient are equivalent. The Pearson correlation coefficient is a measure of the linear correlation between two variables X and Y. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation.

3. Multiple Linear Regression

Introduction

Multiple Linear Regression (MLR) is an extension of simple linear regression that uses more than one independent variable to predict a dependent variable. In essence, while simple linear regression tries to draw a straight line fit between two variables, MLR aims to fit a line through a multi-dimensional space of data points.

The main idea behind MLR is that more than one predictor variable or feature can be used to predict a response. For instance, in predicting house prices, instead of just using the house size as a predictor, we can also include other features like the number of bedrooms, the age of the house, proximity to public transport, and so on. This way, MLR provides a more holistic approach to predicting the response variable.

The equation for MLR is usually represented as:

$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p + \epsilon$

Where: - $Y$ is the dependent variable. - $\beta_0$ is the y-intercept. - $\beta_1, \beta_2, ..., \beta_p$ are the coefficients of the predictor variables. - $X_1, X_2, ..., X_p$ are the predictor variables. - $\epsilon$ represents the error term.

Algorithm

Step 1: Model Specification - Choose the form of the equation (e.g., which predictors to include). - Check for any assumptions of linear regression: linearity, independence, homoscedasticity (constant variance of errors), and normality of errors.

Step 2: Parameter Estimation - Using methods like the Least Squares, calculate the coefficients $\beta_1, \beta_2, ..., \beta_p$ .

Step 3: Prediction & Interpretation - Use the model to make predictions on new data. - Interpret the coefficients to understand the relationship between predictors and the response.

Step 4: Model Assessment - Evaluate the model fit using metrics like $R^2$ , adjusted $R^2$ , and RMSE. - Test the significance of the predictors using the F-test and t-tests.

Some Important Questions

When we perform multiple linear regression, we usually are interested in answering a few important questions.

Is at least one of the predictors $X_1, X_2, ..., X_p$ useful in predicting the response?

When we add predictors to a regression model, we want to be sure that they have a meaningful relationship with the response variable. An overall test can help us with this.

The F-statistic is a useful metric here, especially when the number of predictors $p$ is small. It compares the full model with all predictors to a model with no predictors. If the F-statistic is significantly large, it suggests that at least one of the predictors is useful.

The null hypothesis for the F-test is that none of the predictors are related to the response, against the alternative that at least one is useful. If the p-value associated with the F-statistic is small (typically <0.05), it suggests rejecting the null, meaning that at least one predictor is valuable.

Do all the predictors help to explain $Y$ , or is only a subset of the predictors useful?

A fundamental challenge in multiple regression is deciding which predictors should be included in the model. With $p$ predictors, there are $2^p$ potential models. Checking all possible models is impractical when $p$ is large.

Instead of trying out all possible models, we can use criteria like AIC, BIC, and adjusted $R^2$ to select the best subset of predictors.

Variable selection methods: - Forward Selection: Start with a null model (no predictors) and add predictors one-by-one, at each step adding the predictor that gives the most significant improvement in fit. - Backward Selection: Start with all predictors and remove them one-by-one, at each step removing the predictor that has the least impact on fit. - Mixed Selection: A combination of forward and backward selection. Begin with no predictors and add predictors as in forward selection, but after adding a predictor, check the model for any predictors that no longer provide a good fit and remove them.

How well does the model fit the data?

Model fit can be assessed using various metrics. The Residual Standard Error (RSE) gives a measure of the typical difference between the observed and predicted values. The $R^2$ statistic provides the proportion of variance in the response that's explained by the predictors. Adjusted $R^2$ is a modified version of $R^2$ that penalizes the inclusion of unnecessary predictors, making it more useful when comparing models with different numbers of predictors.

Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

Once a model is built, predictions can be made for new data. However, it's crucial to assess the accuracy of these predictions.

Confidence Interval: Gives a range for the mean response value at a particular predictor value, taking into account the uncertainty of the model coefficients.
Prediction Interval: Provides a range for an individual prediction, considering both the uncertainty of the model coefficients and the natural variability in the response.

Other Considerations

Categorical Predictors

Single categorical predictor with two levels:
These are binary or dichotomous variables. One category is often coded as "1" and the other as "0", commonly referred to as dummy coding. For example, in modeling the effect of gender on a response, males might be coded as "1" and females as "0".
Single categorical predictor with more than two levels:
This involves the use of multiple dummy variables. For a predictor with three levels (A, B, C), two dummy variables might be created. For example, A might be represented as "00", B as "10", and C as "01".
Multiple categorical predictors:
Each categorical predictor is broken down into its dummy variables. The interpretation becomes a comparison to the reference category for each predictor.

Collinearity

Collinearity refers to a situation in which two or more predictor variables are closely related to one another. This can make it difficult to identify the individual effect of predictors on the response. Tools like Variance Inflation Factor (VIF) can be used to detect multicollinearity.

Correlation of error terms

Ideally, the error terms in a regression model should be uncorrelated. If they are correlated, it might suggest that the model is missing some information. Time series data often faces this issue, where errors in one period are correlated with errors in another.

Think of time series data, like stock prices. If a stock price is higher than predicted today (positive error), it might be higher than predicted tomorrow as well due to momentum or some other market factor. This introduces a correlation between errors of subsequent time points.

A plot of residuals over time can help in identifying this. Tracking of residuals (i.e., exhibiting a pattern, such as consistently remaining positive or going up and down in a predictable rhythm) over time suggests correlation.

Non-constant variance of error terms

The assumption that the variance of the error terms is constant (homoscedasticity) is crucial for linear regression. Heteroscedasticity can lead to inefficient parameter estimates, meaning the estimated coefficients are not as precise as they could be. Moreover, the standard errors can be biased, leading to unreliable hypothesis tests.

For example, if the variance of the error terms increases with the predictor variable, the model might overestimate the precision of the parameter estimates. This is because the model will try to fit the points with higher variance more closely, leading to a larger slope.

We can address heteroscedasticity by transforming the response variable or using weighted least squares, or adding the missing predictor variables that can explain the variance.

Outliers

Outliers are data points that don't follow the pattern of the rest of your data. They can unduly influence the regression line and the predictions. Diagnostic tools like residual plots can help in identifying outliers.

High-leverage points

These are data points that have unusual predictor values. They might not necessarily impact the model fit, but they can influence the regression line significantly due to their extreme x-values. Cook's distance is one measure to detect high-leverage points.

Non-linear Relationships

Interaction terms:
- These capture the effect on the response when two predictors interact with each other. For instance, the impact of advertising on sales might depend on the region of advertising. An interaction term between advertising and region can capture this.
Polynomial terms:
- Sometimes, the relationship between predictors and the response is not linear but curvilinear. By adding squared (or even higher-degree) terms of the predictor to the model, these non-linear relationships can be captured.

Summary

Consideration	Problem	Diagnosis	Solution
Categorical Predictors	-	-	-
Single categorical predictor with two levels	Need to encode binary categories for modeling.	Use of dummy coding.	Encode one category as "1" and the other as "0".
Single categorical predictor with more than two levels	Need to encode multi-category variables.	Use of multiple dummy variables.	For three levels (A, B, C), use two dummy variables: A="00", B="10", C="01".
Multiple categorical predictors	Complex multi-category encoding across predictors.	Break each predictor into dummy variables.	Compare to the reference category for each predictor.
Collinearity	Closely related predictors can blur individual effects.	Check for high inter-correlation among predictors.	Use Variance Inflation Factor (VIF) to detect and address.
Correlation of error terms	Correlated errors suggest model is missing information.	Tracking of residuals over time in time series data.	Incorporate missing time-dependent factors or use methods designed for time series.
Non-constant variance of error terms	Non-constant error variance can bias estimates.	Check if error variance changes with predictor variable.	Transform response or use weighted least squares. Consider adding missing predictors.
Outliers	Can unduly influence regression line and predictions.	Use residual plots and other diagnostics.	Identify and address outliers, possibly by removal or transformation.
High-leverage points	Unusual predictor values can heavily influence the regression line.	Check for extreme x-values.	Use Cook's distance or other measures to detect and handle high-leverage points.
Non-linear Relationships	Linear model may not capture curvilinear relationships.	Check residuals and scatter plots for non-linear patterns.	Add interaction terms or polynomial terms to the model.

Metrics of Fit

Metrics for Assessing Significance of Individual Predictors

The table above in simple linear regression also applies to multiple linear regression.

Metrics for Assessing Overall Model Fit:

Metric/Test	Formula (or Definition)	Rationale	Range	Interpretation	Pros	Cons	When to Use
$R^2$ (Coefficient of Determination)	$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$	Quantifies variance captured by the predictors collectively.	0 to 1	Proportion of variance in dependent variable explained by the model. Higher is generally better.	Easy to interpret; values between 0 and 1.	Can be overly optimistic; doesn't confirm causal relationship.	General sense of overall model fit.
Adjusted $R^2$	$1 - \left(\frac{1 - R^2}{n-1}\right) \times \frac{n}{n-p-1}$	Adjusts $R^2$ based on the number of predictors to penalize overfitting.	0 to 1	Adjusted measure of explained variance that accounts for model complexity. Higher is generally better.	Adjusts for number of predictors; less optimistic than $R^2$ .	Can still be high with unnecessary predictors.	Multiple regression when comparing models with different numbers of predictors.
AIC (Akaike Information Criterion)	$2k - 2\ln(L)$	Balances model fit with complexity.	$-\infty$ to $\infty$	Lower AIC suggests a better model.	Penalizes model complexity; useful for model selection.	Relative measure, no absolute interpretation.	Model selection, especially in multiple regression.
BIC (Bayesian Information Criterion)	$\ln(n)k - 2\ln(L)$	Similar to AIC, but provides a stronger penalty for model complexity.	$-\infty$ to $\infty$	Lower BIC suggests a better model.	Stronger penalty for complexity than AIC; useful for model selection.	More conservative than AIC.	Model selection, especially when comparing models with substantially different numbers of predictors.
F-test	$F = \frac{\text{variance explained by the model}}{\text{variance not explained by the model}}$	Compares the model's fit to data against a no-predictor model.	0 to $\infty$	Larger F value suggests a significant model overall.	Useful for comparing models; holistic measure.	In simple regression, tests a similar hypothesis as the t-test for overall model significance.	Comparing nested models or in multiple regression for overall significance.
Residual Analysis	Residual = Observed value - Predicted value	Validates model assumptions by analyzing overall error patterns.	Varies based on data	Patternless residuals suggest model fits the overall data structure; look for trends or outliers in plots.	Provides holistic insights into fit, outliers, assumptions.	Requires subjective judgment.	Throughout the modeling process to validate overall model fit.
RMSE (Root Mean Square Error)	$RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}$	Measures the average magnitude of residuals for the entire model.	0 to $\infty$	Lower RMSE indicates better overall fit; gives error magnitude in units of dependent variable.	Absolute measure of fit; units of dependent variable.	Sensitive to outliers.	To gauge the overall magnitude of model prediction errors.

Note: In the formulas for AIC and BIC, $k$ represents the number of parameters in the model, $L$ is the likelihood of the model, and $n$ is the number of observations. The adjusted $R^2$ formula provided assumes one predictor; for multiple predictors, $p$ represents the number of predictors in the model.

4. Linear Regression vs. K-Nearest Neighbors

	Linear Regression	K-Nearest Neighbors
Model Type	Parametric	Non-parametric
Model Complexity	Low	High
Model Interpretability	High	Low
Model Performance	High	Low
Training Speed	Fast	Slow
Prediction Speed	Fast	Slow
Data Requirements	Large	Small
Pros	- Simple to implement and interpret. - Fast to train and predict. - Performs well with a small number of observations.	- No assumptions about data. - Performs well with a large number of observations. - Can be used for classification and regression.
Cons	- Assumes a linear relationship between predictors and response. - Sensitive to outliers. - Performs poorly with a large number of predictors.	- Slow to train and predict. - Requires careful preprocessing of data. - No extrapolation. - Performs poorly with a large number of predictors.

5. Q&A

Q1: What does the Maximum Likelihood Estimation (MLE) method aim to achieve in the context of statistical models? A1: MLE aims to find the parameter values that maximize the likelihood of the observed data given the model.

Q2: How does the Adjusted $R^2$ differ from the regular $R^2$ , and why might it be useful in multiple linear regression? A2: Adjusted $R^2$ adjusts for the number of predictors in the model and penalizes overfitting. It is especially useful in multiple regression when comparing models with different numbers of predictors as it accounts for model complexity.

Q3: In the comparison between Linear Regression and K-Nearest Neighbors, which method is considered non-parametric and why? A3: K-Nearest Neighbors is considered non-parametric because it doesn't make any assumptions about the functional form of the relationship between predictors and response and relies directly on the observed data.

Q4: What is the primary concern of having high collinearity among predictors in a regression model, and how can it be detected? A4: High collinearity can make it difficult to identify the individual effects of predictors on the response due to their close relationship. One tool to detect multicollinearity is the Variance Inflation Factor (VIF).

Q5: According to the lecture, what is one of the main advantages of Linear Regression over K-Nearest Neighbors in terms of model interpretability and speed? A5: Linear Regression is simpler to implement and interpret compared to K-Nearest Neighbors. Additionally, it is faster to train and predict.

Q6: How should you encode a single categorical predictor with more than two levels for use in a regression model? A6: For a predictor with three levels (A, B, C), use two dummy variables: A="00", B="10", C="01".

Q7: What issue arises when there is a non-constant variance of error terms in regression models, and how can it be addressed? A7: Non-constant error variance can bias estimates. To address this, one can transform the response, use weighted least squares, or consider adding missing predictors.

Q8: Why are outliers a concern in regression analysis? A8: Outliers can unduly influence the regression line and predictions, potentially leading to misleading results.

Q9: When analyzing the residual plots of a regression model, what might a non-linear pattern suggest? A9: A non-linear pattern in the residuals might suggest that the model isn't capturing curvilinear relationships between predictors and the response.

Q10: What is the primary difference between AIC and BIC when assessing the fit of a regression model? A10: Both AIC and BIC balance model fit with complexity, but BIC provides a stronger penalty for model complexity compared to AIC.

Q11: How does the Bayesian Information Criterion (BIC) penalize model complexity compared to the Akaike Information Criterion (AIC)? A11: BIC provides a stronger penalty for model complexity than AIC, making it more conservative.

Q12: What do the symbols $k$ , $L$ , and $n$ represent in the formulas for AIC and BIC? A12: In the formulas for AIC and BIC, $k$ represents the number of parameters in the model, $L$ is the likelihood of the model, and $n$ is the number of observations.

Q13: In the comparison table between Linear Regression and K-Nearest Neighbors, which method requires a smaller dataset to perform well? A13: Linear Regression performs well with a small number of observations, while K-Nearest Neighbors performs well with a larger dataset.

Q14: What's a notable disadvantage of K-Nearest Neighbors when dealing with a large number of predictors? A14: K-Nearest Neighbors performs poorly with a large number of predictors.

Q15: Why is the $R^2$ (Coefficient of Determination) metric considered to be potentially overly optimistic in assessing model fit? A15: $R^2$ can be overly optimistic because it always increases with the addition of more predictors, regardless of their actual relevance or contribution to the model.

Q16: What does the term "high-leverage points" refer to in the context of regression analysis? A16: High-leverage points are observations with unusual predictor values that can heavily influence the regression line.

Q17: How can collinearity be detected in regression models? A17: Collinearity can be detected by checking for high inter-correlation among predictors. The Variance Inflation Factor (VIF) is a common tool used to detect and address collinearity.

Q18: What is the primary advantage of using the adjusted $R^2$ over the regular $R^2$ in multiple regression? A18: The adjusted $R^2$ accounts for the number of predictors and penalizes overfitting, making it more appropriate for models with multiple predictors.

Q19: Between Linear Regression and K-Nearest Neighbors, which one does not make assumptions about the underlying data? A19: K-Nearest Neighbors does not make assumptions about the underlying data, making it non-parametric.

Q20: Why might RMSE (Root Mean Square Error) be a particularly useful metric for certain applications? A20: RMSE measures the average magnitude of residuals for the entire model, giving an absolute measure of fit in the units of the dependent variable. This makes it valuable for understanding the real-world implications of prediction errors.