Lecture 4: Support Vector Machine (SVM)

Date: 2023-05-11

1. Overview

SVM is a powerful classification method, often used in machine learning. It operates by finding a hyperplane that best divides a dataset into classes, optimizing for the largest margin between data points of two classes.

What is a Margin?

Margin refers to the distance between the decision boundary (or hyperplane) and the nearest data point from either class. Think of it like a buffer or safety zone around the separating line. A larger margin implies that our classifier is more confident in its decisions.

Maximal Margin Classifier

This is an SVM that tries to maximize the margin between two classes. It finds the hyperplane for which the margin, or distance to the closest point from either class, is maximized. However, this only works perfectly when data is linearly separable.

Support Vector Classifier

When the data isn't perfectly separable, we might allow some misclassifications to get a better overall model. This is where the Support Vector Classifier comes in. It tries to find the best hyperplane that separates most of the data correctly, allowing for some flexibility in misclassifying data points.

Support Vector Machine

SVM generalizes the concept of the Support Vector Classifier to allow for more complex decision boundaries using something called "kernels". Kernels let us create non-linear decision boundaries by transforming our input data into a higher-dimensional space.

Non-separable Case and Noise

In the real world, data is often noisy and non-separable. In these cases, SVM can still be applied by allowing some errors in classification. The "C" parameter in SVM helps manage this trade-off: a smaller C gives a wider margin but may misclassify more points, while a larger C gives a narrower margin and tries to classify all points correctly but might overfit to the noise.

2. Support Vector Classifiers

Support Vector Classifiers are a form of SVM used when the data isn't perfectly linearly separable. Instead of insisting on perfectly separating the two classes, SVC allows some data points to be on the wrong side of the margin to get a better overall classification boundary.

Optimization Problem

The support vector classifier is the solution to the following optimization problem:

$\begin{align*} \text{Minimize:} \quad & \frac{1}{2} ||w||^2 + C \sum_{i=1}^n \xi_i \\ \text{Subject to:} \quad & y_i(w \cdot x_i + b) \geq 1 - \xi_i \quad \text{for all} \quad 1 \leq i \leq n \\ & \xi_i \geq 0 \quad \text{for all} \quad 1 \leq i \leq n \end{align*}$

Where:

$w$ and $b$ define the hyperplane.
$\xi_i$ are slack variables, which allow individual data points to be on the wrong side of the margin or even the hyperplane.
$C$ is a regularization parameter that determines the trade-off between maximizing the margin and classifying all points correctly.

Intuition

First term, $\frac{1}{2} ||w||^2$ : - This term represents the magnitude (or "size") of the vector $w$ , which defines the hyperplane's orientation. - Minimizing $||w||^2$ essentially means we're trying to maximize the margin between the two classes in our data. The larger the margin, the more confident we are that new data points will be classified correctly.

Second term, $C \sum_{i=1}^n \xi_i$ : - This term is a penalty. For each data point $i$ , we have a corresponding "slack" value, $\xi_i$ , which measures how much that data point violates the desired margin or is misclassified. - $C$ is a constant that determines the "weight" or importance of these violations. A large $C$ makes the penalty for violations very high, so the classifier will try hard to avoid them. A small $C$ means we're more forgiving of violations if it results in a larger margin.

Constraints:

$y_i(w \cdot x_i + b) \geq 1 - \xi_i \quad \text{for all} \quad 1 \leq i \leq n$

This constraint ensures that each data point $x_i$ is on the correct side of the margin. If a data point is perfectly classified with the desired margin, $\xi_i$ is zero.
If $\xi_i > 0$ , it means that point is within the margin or on the wrong side of the hyperplane. The exact value of $\xi_i$ measures the "degree" of this violation.

$\xi_i \geq 0 \quad \text{for all} \quad 1 \leq i \leq n$

This simply ensures that the slack variables are non-negative. It wouldn't make sense to have a negative value for how much a point violates the margin.

Intuitive Recap:

In simpler terms, the formula tries to find a balance: - We want a big margin (which is achieved by minimizing $||w||^2$ ). - But we also want to classify points correctly, and we're willing to accept some mistakes or violations, especially if they allow us to get a broader margin. - The slack variables $\xi_i$ measure these mistakes, and $C$ decides how strict or lenient we are about them.

It's like trying to draw a line between two groups of points so that the line is as wide as possible while having as few points as possible on the wrong side or too close to the line.

Solving the Optimization Problem

The optimization problem described is a constrained quadratic optimization problem. One of the most common methods to solve it is:

Sequential Minimal Optimization (SMO): - Developed by John Platt in the late 1990s. - It breaks down the problem into smaller problems that can be solved analytically, instead of numerically. - The basic idea is to fix all variables except for two, solve the 2-variable sub-problem analytically, and repeat the process until convergence.

The beauty of SMO is in its efficiency. By solving small pieces of the problem at a time, it can quickly converge to the global optimum.

Other methods, like gradient descent, can also be applied, but SMO is specifically tailored for the SVM problem and is typically more efficient.

Popular libraries like libsvm and software tools like MATLAB's fitcsvm function utilize SMO or its variants to train SVM classifiers.

3. Support Vector Machine (SVM)

Support Vector Classifiers work remarkably well when data is almost linearly separable, but real-world data often isn't that neat. There might be intricate patterns, curves, or clusters that a straight line (or a flat plane in higher dimensions) can't separate efficiently. This is where the Support Vector Machine (SVM) enters the scene. While SVC focuses on linear boundaries, SVM offers a way to handle more complex, non-linear decision boundaries, making it a versatile tool for various classification problems.

Classification with Non-Linear Decision Boundaries

Many datasets have distributions that can't be separated with just a straight line. Imagine a scenario where data points form a circular pattern: one class forms a ring, and another class is clustered inside this ring. No straight line can separate these two classes. This is where SVM shines.

By using something called the kernel trick, SVM can project data into higher dimensions where a linear separator becomes feasible. This "separator" in the higher dimension, when projected back to the original space, might appear as a circle, curve, or some other non-linear boundary.

Just like with SVC, the choice of parameter $C$ and the slack variables $\xi_i$ plays a crucial role in SVM, especially when deciding the trade-off between a wider margin and classification error.

The "Kernel Trick"

When faced with non-linear data, one approach is to map the original feature space into a higher-dimensional space where the data becomes linearly separable. In theory, we could manually perform this mapping, find the hyperplane in the higher dimension, and then translate this back to our original space as a non-linear boundary. But this can be computationally expensive.

The kernel trick avoids explicitly computing this mapping and the coordinates in the higher-dimensional space. Instead, it directly computes the inner products between the images of all pairs of data in this higher-dimensional space.

A kernel function is a function that computes the dot product between the transformed vectors in the higher-dimensional space without us having to define the transformation explicitly.

Mathematically, let's denote the mapping as $\phi(x)$ . A kernel function is then defined as:

$K(x_i, x_j) = \phi(x_i) \cdot \phi(x_j)$

Where: - $x_i$ and $x_j$ are two data points in the original feature space. - $\phi$ represents the transformation from the original space to the higher-dimensional space. - $K$ is our kernel function.

The beauty here is that we only need to compute $K$ and not the individual coordinates of $\phi(x)$ .

Common Kernel Functions

Linear Kernel: It's simply the dot product of the two input vectors. This doesn't change the feature space. $K(x_i, x_j) = x_i \cdot x_j$
Polynomial Kernel: This creates interaction terms and elevates features to a certain power. $K(x_i, x_j) = (x_i \cdot x_j + 1)^d$ Where $d$ is the degree of the polynomial.
Radial Basis Function (RBF) or Gaussian Kernel: This is an incredibly versatile kernel, capable of creating very complex decision boundaries. $K(x_i, x_j) = \exp(-\gamma ||x_i - x_j||^2)$ Where $\gamma$ is a parameter that needs to be specified.
Sigmoid Kernel: Mirrors the properties of the neural perceptron. $K(x_i, x_j) = \tanh(\alpha x_i \cdot x_j + c)$ Where $\alpha$ and $c$ are constants.

Why is it a "Trick"?

The kernel trick is a nifty computational shortcut. Instead of explicitly computing the transformation and working in the potentially vast higher-dimensional space, the kernel trick allows us to work in the original feature space while leveraging the power of higher dimensions. By merely computing the kernel functions, which often have simple formulas like the ones above, SVM can classify non-linear data without the massive computational overhead.

Example

To illustrate the kernel trick, let's dive into a scenario where data isn't linearly separable. In these cases, SVM can still find a decision boundary by mapping the data into a higher-dimensional space using a kernel function. Once in this higher-dimensional space, the data might be linearly separable, allowing SVM to find a hyperplane.

Example with Kernel Trick:

Imagine a simple 2D scenario where you have points from two classes:

Class 1 (label: +1) is clustered in a circle of radius 1.
Class 2 (label: -1) is clustered in a larger circle with radius 2, surrounding Class 1.

Clearly, in this 2D space, you can't separate the two classes with a straight line. But what if we could transform this space so that the two classes are separable?

Radial Basis Function (RBF) Kernel:

One common kernel used in SVM is the RBF or Gaussian kernel. It's defined as: $K(x, x') = e^{-\gamma ||x - x'||^2}$ Where $x$ and $x'$ are data points and $\gamma$ is a parameter.

In our circles example, the RBF kernel can be thought of as adding a new dimension to our 2D space, creating a 3D space. For a given point $x$ , its new coordinate in this third dimension is determined by its distance from the center of the circles.

Transformation:

When you apply this transformation, the original circles on our 2D plane get transformed into a 3D space where Class 1 forms a peak and Class 2 forms a doughnut shape around it. In this 3D space, you can imagine a flat plane that slices right between the peak and the doughnut, perfectly separating the two classes!

Classifying with the Kernel Trick:

Once trained, when you get a new data point in your original 2D space, you apply the same transformation (using the kernel function) to figure out where it lands in the 3D space and then determine which side of the hyperplane it's on.

Conclusion:

The kernel trick is a powerful technique that allows SVM to deal with non-linear data without explicitly computing the transformation. By just computing the kernel (dot product in the transformed space) between data points, SVM can work in this implicitly transformed space.

In practice, for SVM, there are various kernel functions available (linear, polynomial, RBF, sigmoid, etc.), and the choice of kernel often depends on the nature of the data and the problem at hand. The RBF kernel is particularly popular due to its flexibility in handling a variety of data structures.

4. Advanced Topics

SVMs with More than Two Classes

Traditionally, SVMs are binary classifiers; they're designed to handle two classes. However, in practice, we often come across problems with more than two classes, and there are strategies to adapt SVMs for these multi-class tasks:

One-vs-One (OvO):
For a problem with $k$ classes, train $\frac{k(k-1)}{2}$ classifiers. Each classifier is trained on data from two classes.
For prediction, run all $\frac{k(k-1)}{2}$ classifiers on a test instance. The class that gets predicted the most number of times becomes the final prediction for the instance.
One-vs-All (or One-vs-Rest, OvA or OvR):
For a problem with $k$ classes, train $k$ classifiers. For each classifier, one class is treated as the positive class while all other classes are combined and treated as the negative class.
For prediction, run all $k$ classifiers on a test instance. The classifier that predicts the highest confidence score determines the final class of the instance.

In terms of efficiency, OvA is generally faster as it requires training fewer classifiers. However, in terms of accuracy, the best method can vary depending on the specific problem and data distribution.

Relationship to Logistic Regression

Both SVM and Logistic Regression (LR) are linear models used for classification tasks, but they have different objectives and make their decisions based on different principles. Here's a comparison:

Objective Function:
SVM: It aims to find a hyperplane that maximizes the margin between the closest points (support vectors) of two classes. It's primarily concerned with getting the 'widest' possible separation.
LR: It estimates probabilities by modeling the log odds of the binary response variable as a linear combination of the predictors. The objective is to minimize the log loss (or maximize the likelihood of the observed data).
Decision Boundary:
SVM: Always linear (or hyperplane in higher dimensions) unless using the kernel trick for non-linear boundaries.
LR: Always linear in its base form.
Robustness:
SVM: Since it focuses on the points that are hardest to tell apart (the support vectors), SVM can be more robust to outliers than LR.
LR: It tries to minimize the overall error, which can make it more sensitive to outliers.
Probabilistic Interpretation:
SVM: Doesn't naturally output probabilities for class membership.
LR: Directly provides the probability of class membership.
Regularization:
Both methods can incorporate regularization (like L1 or L2 norms) to avoid overfitting. Regularization in SVM is controlled by the parameter $C$ , whereas in LR it's typically controlled by a parameter $\lambda$ .

In essence, while both SVM and LR can often produce similar results, they approach the classification problem from different angles. SVM focuses on the geometry of the data and finds the best "street" between classes, whereas LR focuses on maximizing the likelihood of the observed data given the model. Depending on the nature of the data and the problem, one might perform better than the other.

5. Q&A

Q: What is the primary objective of an SVM? A: The primary objective of an SVM is to find a hyperplane that best separates data into two classes while maximizing the margin between the closest points (support vectors) of the two classes.
Q: In an SVM, what are support vectors? A: Support vectors are the data points that lie closest to the separating hyperplane and effectively determine the orientation and position of the hyperplane. They are critical because they are the only data points that influence the optimal location of the hyperplane.
Q: What is the difference between a maximal margin classifier and a support vector classifier? A: A maximal margin classifier finds a hyperplane that perfectly separates two classes with the widest margin. However, it assumes that the data is linearly separable. On the other hand, a support vector classifier allows for some misclassifications or violations in the margin, making it suitable for data that isn't perfectly linearly separable.
Q: How does the kernel trick help SVMs? A: The kernel trick allows SVMs to create non-linear decision boundaries by implicitly mapping the input data into higher-dimensional spaces without actually performing the computation in that high-dimensional space. It achieves this by only computing the dot products between the mapped data points.
Q: What role does the regularization parameter $C$ play in SVM? A: The regularization parameter $C$ determines the trade-off between maximizing the margin and classifying training points correctly. A smaller $C$ value gives more importance to maximizing the margin, even if it misclassifies more points, while a larger $C$ value emphasizes correct classification of the training points.
Q: How can SVMs be used for multi-class classification problems? A: SVMs can be adapted for multi-class tasks using strategies like One-vs-One (OvO) or One-vs-All (OvA/OvR). OvO involves training separate classifiers for each pair of classes, while OvA trains a classifier for each class against all other classes combined.
Q: Why might one choose an SVM over Logistic Regression for a classification problem? A: One might choose SVM over Logistic Regression if the data has a clear margin of separation, if the problem involves high-dimensional space, or if the data contains outliers since SVM, especially with certain kernels, can be more robust against outliers.
Q: What is a radial basis function (RBF) in the context of SVMs? A: RBF is a type of kernel used in SVMs to create non-linear decision boundaries. It's defined as $K(x, x') = \exp(-\gamma ||x - x'||^2)$ , where $\gamma$ is a parameter that determines the shape of the decision boundary.
Q: How does an SVM handle unbalanced data? A: For unbalanced datasets, SVMs can be sensitive to the majority class, leading to a sub-optimal decision boundary. However, this can be mitigated by using techniques like class weighting, oversampling the minority class, or undersampling the majority class.
Q: In what scenario might one use a polynomial kernel in SVM? A: A polynomial kernel, given by $(x \cdot x' + 1)^d$ where $d$ is the degree of the polynomial, can be useful when the data has polynomial relationships or when it's structured in a way that a polynomial decision boundary would provide a better fit than a linear or RBF kernel.
Q: How do slack variables in an SVM formulation help with non-linearly separable data? A: Slack variables allow some data points to violate the margin or even be misclassified, providing flexibility when data isn't perfectly linearly separable. They essentially provide a "cushion" to account for the non-separability of the data.
Q: Why might one opt for a linear kernel over a more complex kernel in SVM? A: Even though more complex kernels allow for non-linear decision boundaries, they can sometimes lead to overfitting. If the data is nearly linear or if simpler models are preferred for interpretability or computational efficiency, a linear kernel might be chosen.
Q: What is the "dual problem" in SVMs, and why is it important? A: The dual problem is an alternative formulation of the SVM optimization problem. It focuses on maximizing the Lagrange multipliers rather than minimizing the primal objective. Solving the dual is computationally advantageous, especially for non-linear SVMs, and it naturally incorporates the kernel trick.
Q: How can an SVM be used for regression rather than classification? A: SVMs can be adapted for regression tasks, called Support Vector Regression (SVR). SVR tries to fit a hyperplane such that most of the data points are within an epsilon-wide margin, while penalizing points outside this margin.
Q: In what scenarios is it beneficial to use a sigmoid kernel in SVM? A: A sigmoid kernel, defined as $K(x, x') = \tanh(\alpha x \cdot x' + c)$ , can transform the data similarly to a neural network's activation function. It might be useful in cases resembling two-class neural network predictions, but it's less commonly used than RBF or polynomial kernels in practice.
Q: What potential issues can arise from using too large a value for the regularization parameter $C$ in SVMs? A: A large value of $C$ places high emphasis on classifying all training examples correctly. This can lead to overfitting, especially in the presence of noisy or overlapping data, as the SVM will try to fit to outliers or anomalies.
Q: How does the choice of kernel in an SVM affect its computational efficiency? A: The choice of kernel can significantly affect computational efficiency. For example, linear SVMs are generally faster to train than non-linear SVMs. Some kernels, like RBF, may require more computations, especially if they transform data into a much higher dimensional space.
Q: Why might feature scaling be important when training an SVM? A: Feature scaling ensures that all features contribute equally to the distance computation in SVM. Without scaling, features with larger magnitudes can dominate, potentially leading to sub-optimal decision boundaries.
Q: What role does the bias term $b$ play in the SVM decision function? A: The bias term $b$ allows the hyperplane to shift away from the origin. Without $b$ , the hyperplane would always pass through the origin, limiting its flexibility in classifying data points.
Q: How do SVMs generally perform in terms of interpretability compared to other classifiers? A: SVMs, especially with non-linear kernels, are generally less interpretable than simpler models like logistic regression or decision trees. The decision boundary in high-dimensional or transformed spaces might be hard to visualize or explain in the original feature space.