Lecture 3. Classification

Date: 2023-02-14

1. Why not linear regression for classification?

Linear regression is a powerful tool for predicting numerical outcomes based on input variables. However, when it comes to classification tasks, it presents several limitations:

Implied Natural Ordering

Linear regression assumes a continuous outcome, implying a natural ordering. For many classification scenarios, categories don't have such a meaningful order. For instance, predicting animal types like "cat", "dog", and "bird" using numbers would inadvertently suggest a hierarchy or order that doesn't truly exist.

Predicted Values Beyond the Classification Range

Linear regression might produce values below 0 or above 1, which don't make sense when we're trying to classify instances into categories, especially when we want to interpret the outcomes as probabilities.

Sensitive to Outliers

Both linear regression and some classification methods can be sensitive to outliers. However, the impact is different:

In linear regression, outliers can drastically change the slope and intercept of the regression line.
In classification, especially with methods like logistic regression, outliers can affect the decision boundary and lead to misclassification.

Assumption of Homoscedasticity

Linear regression assumes constant variance of the residuals across levels of the independent variables. This assumption can be violated in classification problems, especially when categories are imbalanced.

Loss Function Mismatch

Linear regression minimizes the mean squared error, which might not be the best approach for classification tasks. Classification often requires loss functions like cross-entropy, which are designed to handle discrete outcomes better.

2. Logistic Regression

The Logistic Model

Logistic regression is applied when the dependent variable is binary. At its heart lies the logistic function, which takes the form of an S-shaped curve:

$P(Y=1) = \frac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}}$

Where: - $P(Y=1)$ represents the probability of the outcome being 1. - $\beta_0$ is the y-intercept. - $\beta_1$ is the coefficient of the predictor $X$ .

Through some algebraic manipulation, we can express the logistic function as:

$\frac{P(Y=1)}{1 - P(Y=1)} = e^{\beta_0 + \beta_1X}$

Taking the natural logarithm of both sides, we obtain:

$\log\left(\frac{P(Y=1)}{1 - P(Y=1)}\right) = \beta_0 + \beta_1X$

The left side of this equation is referred to as the log odds or logit. The equation signifies that logistic regression models the log odds linearly in terms of the predictor(s).

Estimating the Regression Coefficients

Maximum Likelihood Estimation (MLE) is favored over least squares in estimating the coefficients for logistic regression because it yields estimates that are more reliable in the context of bounded outcomes (i.e., probabilities between 0 and 1).

The likelihood function, in the context of logistic regression, is defined as:

$L (\beta_0, \beta_1) = \Pi_{i:y_i=1} P(X=x_i) \Pi_{i':y_{i'}=0} (1 - P(X=x_{i'}))$

For the sake of computational simplicity, it's preferable to maximize the log-likelihood, as given by:

$l(\beta_0, \beta_1) = \sum_{i=1}^n y_i \log P(X=x_i) + (1 - y_i) \log (1 - P(X=x_i))$

Unlike linear regression, there isn't a closed-form solution to ascertain the maximum likelihood estimates of the coefficients. Consequently, we rely on iterative techniques to maximize the log-likelihood function. One common approach is the Newton-Raphson method:

$\beta^{n+1} = \beta^n - \frac{l'(\beta^n)}{l''(\beta^n)}$

Using matrix notation, this becomes:

$\beta^{n+1} = \beta^n - H^{-1}(\beta^n) \nabla l(\beta^n)$

Here, $H$ is the Hessian matrix and $\nabla l(\beta^n)$ represents the gradient of the log-likelihood.

The iteration can sometimes be problematic because: 1. Convergence Issues: The procedure may not converge to a solution, indicating potential problems with the data or model specifications. 2. Computational Expense: Evaluating the Hessian and its inverse can be computationally intensive, especially with a large number of predictors. 3. Numerical Instabilities: With certain data configurations (e.g., perfect separation), the iterative procedure might lead to very large coefficient estimates, causing instability in predictions.

Interpreting the Regression Coefficients

Linear Regression

$Y = \beta_0 + \beta_1X$

$\beta_0$ : intercept.
$\beta_1$ : slope.

Logistic Regression

$\log\left(\frac{P(Y=1)}{1 - P(Y=1)}\right) = \beta_0 + \beta_1X$

$\beta_0$ : log odds of the outcome being 1 when $X=0$ .
$\beta_1$ : change in the log odds of the outcome being 1 for a one-unit increase in $X$ .

Making Predictions

Just plug in $X$ to the logistic function:

$P(Y=1) = \frac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}}$

Then, we can use a threshold to classify the outcome as 0 or 1. For instance, if $P(Y=1) > 0.5$ , we classify the outcome as 1; otherwise, we classify it as 0.

Multiple Logistic Regression

Just as with linear regression, we can use multiple predictors in logistic regression: $\log\left(\frac{P(Y=1)}{1-P(Y=1)}\right) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ...$

Or equivalently:

$P(Y=1) = \frac{e^{\beta_0 + \beta_1X_1 + \beta_2X_2 + ...}}{1 + e^{\beta_0 + \beta_1X_1 + \beta_2X_2 + ...}}$

Multinomial Logistic Regression

When the outcome has more than two categories, we use multinomial logistic regression. This approach estimates multiple sets of coefficients, essentially comparing each category to a reference category. Each set of coefficients provides the log odds of one category relative to the reference.

3. Generative Models for Classification

Generative vs. Discriminative Models

Generative Models:
- They learn the joint probability distribution $P(X, Y)$ and then use the Bayes rule to compute $P(Y|X)$ .
- Examples include Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), and Naive Bayes.
Discriminative Models:
- Directly learn the boundary between the classes.
- Examples include Logistic Regression and SVM.
- They model $P(Y|X)$ directly.

Bayes Rule

It's a mathematical formula that allows one to find a probability when other related probabilities are known. In the context of generative classification: $P(Y=k|X=x) = \frac{P(X=x|Y=k)P(Y=k)}{\sum_{l=1}^K P(X=x|Y=l)P(Y=l)}$

Linear Discriminant Analysis (LDA)

For $p = 1$ : A single predictor.

Assumption: Both classes follow a normal distribution with means $\mu_0$ and $\mu_1$ and the same variance $\sigma^2$ .
Compute the shared variance from the data: $\sigma^2 = \frac{n_0 \sigma_0^2 + n_1 \sigma_1^2}{n_0 + n_1}$
Using Bayes theorem, the posterior probability for class $k$ is: $P(Y=k|X=x) = \frac{\phi(x; \mu_k, \sigma^2) \pi_k}{\phi(x; \mu_0, \sigma^2) \pi_0 + \phi(x; \mu_1, \sigma^2) \pi_1}$ Where: $\phi(x; \mu, \sigma^2)$ is the Gaussian density function:

$\phi(x; \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( - \frac{(x - \mu)^2}{2 \sigma^2} \right)$

(because we assume that both classes follow a normal distribution with the same variance, we can use the same variance for both classes) and $\pi_k$ is the prior probability of class $k$ . They are typically calculated as:

$\pi_k = \frac{n_k}{n}$

where $n$ is the total number of observations and $n_k$ is the number of observations in class $k$ .
Decision boundary is set where the posterior probabilities of the two classes are equal, i.e., where:

$P(Y=0|X=x) = P(Y=1|X=x)$

For $p > 1$ : Multiple predictors.

Assumption: Observations from each class are drawn from a multivariate Gaussian distribution with a class-specific mean vector and a common covariance matrix.
Estimate the class-specific mean vectors: $\boldsymbol{\mu}_k = \frac{1}{n_k} \sum_{y_i = k} \boldsymbol{x}_i$
Estimate the common covariance matrix: $\boldsymbol{S}_p = \frac{\sum_{y_i = 0} (\boldsymbol{x}_i - \boldsymbol{\mu}_0)(\boldsymbol{x}_i - \boldsymbol{\mu}_0)^T + \sum_{y_i = 1} (\boldsymbol{x}_i - \boldsymbol{\mu}_1)(\boldsymbol{x}_i - \boldsymbol{\mu}_1)^T}{n_0 + n_1 - 2}$
The decision boundary is where: $\boldsymbol{x}^T \boldsymbol{S}_p^{-1} (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_0) = \log \left( \frac{\pi_1}{\pi_0} \right) + \frac{1}{2} (\boldsymbol{\mu}_0 + \boldsymbol{\mu}_1)^T \boldsymbol{S}_p^{-1} (\boldsymbol{\mu}_0 - \boldsymbol{\mu}_1)$
Calculate the linear discriminants for classification using: $\delta_k(\boldsymbol{x}) = \boldsymbol{x}^T \boldsymbol{S}_p^{-1} \boldsymbol{\mu}_k - \frac{1}{2} \boldsymbol{\mu}_k^T \boldsymbol{S}_p^{-1} \boldsymbol{\mu}_k + \log(\pi_k)$

Quadratic Discriminant Analysis (QDA)

For QDA, the decision boundary is quadratic because each class has its own covariance matrix, unlike LDA where there's a shared covariance matrix.

Assumptions:
- Observations within each class are drawn from a multivariate Gaussian distribution.
- Each class has its own covariance matrix $\Sigma_k$ .
Posterior Probability for class $k$ : $P(Y=k|X=x) = \frac{\phi(x; \mu_k, \Sigma_k) \pi_k}{\sum_{l=1}^K \phi(x; \mu_l, \Sigma_l) \pi_l}$

Where $\phi$ is the multivariate Gaussian density function:

$\phi(x; \mu, \Sigma) = \frac{1}{(2\pi)^{p/2} |\Sigma|^{1/2}} e^{-\frac{1}{2}(x-\mu)^T \Sigma^{-1} (x-\mu)}$
Decision Boundary: Found by setting the posterior probabilities of two classes to be equal. However, because of the different covariance matrices, this boundary is quadratic.

Naive Bayes

It is called "naive" because it assumes that the features are conditionally independent given the class label. This is a strong assumption that is often unrealistic. For example, in a spam detection problem, the presence of the word "money" might indicate that the word "cash" is also present, making the two features dependent.

Bayes' Rule: $P(Y=k|X=x) = \frac{P(X=x|Y=k) \times P(Y=k)}{P(X=x)}$

Given the independence assumption:

$P(X=x|Y=k) = \prod_{i=1}^{p} P(X_i = x_i | Y=k)$

Where $p$ is the number of predictors.

Gaussian Naive Bayes (for continuous data):

Assume that the continuous values associated with each class are distributed according to a Gaussian distribution. For a class $k$ and a feature $i$ :

$P(X_i = x_i | Y=k) = \frac{1}{\sqrt{2\pi\sigma_{k,i}^2}} e^{-\frac{(x_i-\mu_{k,i})^2}{2\sigma_{k,i}^2}}$

Where $\mu_{k,i}$ and $\sigma_{k,i}^2$ are the mean and variance of feature $i$ for class $k$ .

Decision: An observation is classified into the class for which the posterior probability $P(Y=k|X=x)$ is highest.

It's worth noting that while I've shown the Gaussian version of Naive Bayes, there are other variations, like Multinomial Naive Bayes (useful for word counts) and Bernoulli Naive Bayes (for binary features).

Summary

Model	Assumptions	Pros	Cons	Notes
LDA (Linear Discriminant Analysis)	- Normally distributed data for each class. - Equal covariance matrices for all classes.	- Requires fewer parameters to estimate. - Works well when assumptions hold.	Can perform poorly if assumptions are violated.	Often used when number of features is large.
QDA (Quadratic Discriminant Analysis)	- Normally distributed data for each class. - Allows different covariance matrices.	- More flexible than LDA. - Can model more complex decision boundaries.	- Requires more parameters to estimate. - Can overfit on small datasets.	Useful when classes have distinct covariance.
Naive Bayes	Features are independent given the class label.	- Simple and fast. - Effective for high-dimensional datasets. - Requires less training data.	- Independence assumption is often unrealistic. - Can be outperformed by more sophisticated models.	Variants exist based on feature type (Gaussian, Multinomial, etc.).

4. Generalized Linear Models (GLMs)

Generalized Linear Models (GLMs) are a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. They were formulated by John Nelder and Robert Wedderburn as a way of unifying various other statistical models, including linear regression, logistic regression, and Poisson regression.

Components of GLMs

Random Component: Specifies the probability distribution for the response variable (e.g., normal, binomial, Poisson, etc.). The chosen distribution belongs to the exponential family of distributions.
Systematic Component: Represents the linear combination of the predictors. Mathematically, it can be denoted as: $\eta = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p$ Where $\eta$ is called the linear predictor, and $\beta$ values are the parameters to be estimated.
Link Function: It describes the relationship between the systematic component and the expected value of the random component. The link function transforms the expected value of the response variable to produce the linear combination of the predictors. Commonly used link functions include:
- Identity: Used in linear regression.
- Log: Used in Poisson regression.
- Logit: Used in logistic regression.

Mathematical Formulation

Given $Y$ as the response variable, $E(Y)$ is the expected value of $Y$ . The relationship is given by: $g(E(Y)) = \eta$ Where $g(.)$ is the link function.

GLM Algorithm Steps:

Specify the probability distribution of the response variable (e.g., binomial for binary data, Poisson for count data).
Choose an appropriate link function to relate predictors and response.
Fit the model using maximum likelihood estimation to estimate the parameters.
Test the goodness of fit, to see how well the model explains the variance in the response variable.
Make predictions based on the fitted model.

Advantages of GLMs:

They can handle a variety of response variable types (continuous, binary, counts, etc.).
They provide a unified framework for modeling many different statistical distributions.
They can incorporate non-constant variance.

Limitations:

Assumption that the response variable follows an exponential family distribution.
Like all models, GLMs can suffer from overfitting especially with many predictors.

Examples:

Logistic Regression: When the outcome variable is binary. It uses the binomial distribution and the logit link function.
Poisson Regression: When the outcome variable represents counts. It uses the Poisson distribution and the log link function.
Linear Regression: Can be thought of as a GLM that assumes a normal distribution and uses the identity link function.

5. Q&A

Question: Why is linear regression not suitable for classification tasks?

Answer: Linear regression is designed to predict continuous values, not categorical classes. When used for classification, it might produce predictions outside the [0,1] range, making it hard to interpret. Additionally, classification often involves non-linear boundaries which linear regression can't capture without significant modifications.

Question: What is the primary difference between logistic regression and linear regression?

Answer: Linear regression is used for predicting continuous outcomes, while logistic regression is used for predicting the probability of categorical outcomes (typically binary like 0 or 1). Logistic regression uses the logit link function to squeeze predictions between 0 and 1.

Question: Can you describe the difference between generative and discriminative models?

Answer: Generative models learn the joint probability distribution of the input and output, and they try to model how the data is generated. Examples include LDA and Naive Bayes. Discriminative models, on the other hand, learn the boundary between classes and model the conditional probability of the output given the input. Logistic regression is an example of a discriminative model.

Question: What is Bayes Rule and why is it significant in statistical modeling?

Answer: Bayes Rule provides a way to find the probability of an event occurring given prior knowledge. It's written as: $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$ It's fundamental in statistical modeling because it allows for the incorporation of prior knowledge or beliefs when calculating probabilities.

Question: How does Linear Discriminant Analysis (LDA) classify data points?

Answer: LDA tries to find a linear combination of features that best separates two or more classes in a dataset. It does this by maximizing the distance between means of these classes and minimizing the spread (variance) within each class.

Question: Why does Quadratic Discriminant Analysis (QDA) result in a quadratic decision boundary?

Answer: QDA, unlike LDA, doesn't assume that the covariance matrices are the same for all classes. Because of this, when calculating the decision boundary, the terms involving the inverse of the covariance matrix lead to quadratic terms in the discriminant function, resulting in a quadratic decision boundary.

Question: Why is the 'Naive' in Naive Bayes so called?

Answer: It's called 'Naive' because it assumes that the features (or predictors) are conditionally independent given the class label. This is a strong and often unrealistic assumption, but despite this, Naive Bayes often performs surprisingly well in practice.

Question: What are the key components of Generalized Linear Models (GLMs)?

Answer: GLMs consist of three components: a random component that specifies the probability distribution of the response variable, a systematic component that is a linear function of the predictors, and a link function that relates the mean of the response variable to the systematic component.

Question: How does logistic regression relate to GLMs?

Answer: Logistic regression is a type of GLM where the response variable follows a binomial distribution and the link function used is the logit link. It models the log-odds of the probability of a particular event occurring.

Question: Why would one use QDA over LDA?

Answer: One might use QDA over LDA if they believe that each class has its own covariance structure, or if the decision boundary between classes is non-linear. QDA can be more flexible in this regard compared to LDA.

Question: In the context of logistic regression, what does the term "odds ratio" refer to, and how is it interpreted?

Answer: The odds ratio in logistic regression represents the odds of an event occurring in relation to a one-unit increase in a predictor variable, while holding other variables constant. An odds ratio of 1 means no effect, greater than 1 indicates increased odds, and less than 1 indicates decreased odds.

Question: When would you use LDA over logistic regression?

Answer: LDA might be preferred over logistic regression when the assumptions of equal covariance matrices across classes hold true. It can also be more stable when classes are well-separated or when there's limited sample size.

Question: How does the Naive Bayes classifier handle continuous data?

Answer: For continuous features, Naive Bayes typically assumes that the values are sampled from a Gaussian distribution. It then uses the mean and variance of each class's feature to estimate the Gaussian parameters and make predictions.

Question: What is multicollinearity, and why is it a potential issue in logistic regression?

Answer: Multicollinearity arises when predictor variables in a model are highly correlated. In logistic regression, this can lead to unstable coefficient estimates, making it difficult to determine the individual effect of predictors.

Question: Describe the link between Bayes Rule and Naive Bayes classifier.

Answer: The Naive Bayes classifier is based on applying Bayes Rule, with the "naive" assumption that all features are conditionally independent given the class label. This simplifies the computation of the likelihood term in the Bayes Rule.

Question: What is the primary distinction between LDA and QDA?

Answer: The primary distinction is in the assumption about the covariance matrices. LDA assumes that all classes share the same covariance matrix, leading to a linear decision boundary. QDA allows each class to have its own covariance matrix, leading to a quadratic decision boundary.

Question: Explain the concept of link function in GLMs.

Answer: The link function in GLMs describes the relationship between the systematic component (linear combination of predictors) and the expected value of the response variable. It allows for non-linear relationships between the predictors and the response.

Question: Generative vs. Discriminative models: which typically requires more data for training and why?

Answer: Discriminative models, like logistic regression, typically require more data than generative models, like Naive Bayes or LDA. This is because discriminative models focus on modeling the decision boundary between classes directly, often requiring more examples to accurately capture these boundaries, whereas generative models focus on modeling how each class data is generated.

Question: What assumptions are made when using LDA for classification?

Answer: LDA assumes that the predictor variables are normally distributed, that each class has the same covariance matrix, and that the samples are IID (independent and identically distributed).

Question: In the context of GLMs, what is meant by "overdispersion", and how can it be addressed?

Answer: Overdispersion occurs when the observed variance in the data is higher than what the model predicts, often seen in models like Poisson regression. It can be addressed by using a different distribution (like negative binomial) or by introducing additional variance components.