Homework 4. Support Vector Machine: Banknote Authentication

Introduction

This assignment is again based on UCI's Banknote Authentication Dataset. The dataset contains 1,372 samples of genuine and forged banknotes. Each sample contains four continuous features, and the objective is to classify samples into one of two classes.

Helper Functions

Cost Function

Given the SVM's hinge loss, the loss function $J(W)$ with respect to weight vector $W$ for a dataset $(X, Y)$ is:

$J(W) = \lambda \times \frac{1}{2} ||W||^2 + \sum_{i=1}^{n} \max(0, 1 - y_i(W \cdot x_i))$

In this formula:

$W$ represents the weight vector.
$x_i$ is the feature vector of the i-th sample from $X$ .
$y_i$ is the label of the i-th sample from $Y$ (typically +1 or -1 for binary classification).
$\lambda$ is the regularization strength.
The term $\frac{1}{2} ||W||^2$ represents L2 regularization, which penalizes large weights to avoid overfitting.
The term $\max(0, 1 - y_i(W \cdot x_i))$ represents the hinge loss for each sample. It measures how much a sample violates the SVM's margin, and is zero for samples that are correctly classified outside the margin.

So, this formula calculates the total cost by adding the L2 regularization term and the hinge loss term for the entire dataset. The hinge loss increases as samples are misclassified or are within the SVM's margin, and the L2 regularization increases as the weights grow in magnitude.

def compute_cost(W: np.ndarray, X: np.ndarray, Y: np.ndarray, reg_strength: float = 1000.0) -> float:
    """
    Compute the cost for SVM using the hinge loss and L2 regularization.

    Parameters:
    - W (np.ndarray): Weight vector.
    - X (np.ndarray): Data matrix where each row is a sample.
    - Y (np.ndarray): Labels corresponding to each sample.
    - reg_strength (float): Regularization strength parameter.

    Returns:
    - float: Computed cost.
    """

    l2_reg = 0.5 * np.linalg.norm(W)**2
    hinge_loss = np.sum(np.maximum(0, 1 - Y * X @ W))
    cost = (reg_strength * l2_reg) + hinge_loss

    return cost

Given the SVM's hinge loss, the gradient $\nabla J$ of the loss function $J$ for a single sample $(x_i, y_i)$ and weight vector $W$ is:

$\nabla J(W; x_i, y_i) = \begin{cases} W, & \text{if } 1 - y_i(W \cdot x_i) \leq 0 \\ W - \lambda y_ix_i, & \text{otherwise} \end{cases}$

def calculate_cost_gradient(W: np.ndarray, X_batch: np.ndarray, Y_batch: np.ndarray, reg_strength: float = 1000.0) -> np.ndarray:
    """
    Compute the gradient of the cost for SVM using the hinge loss and L2 regularization.

    Parameters:
    - W (np.ndarray): Weight vector.
    - X_batch (np.ndarray): Data matrix subset (batch) where each row is a sample.
    - Y_batch (np.ndarray): Labels corresponding to each sample in the batch.
    - reg_strength (float): Regularization strength parameter.

    Returns:
    - np.ndarray: Gradient of the cost with respect to the weight vector W.
    """

    batch_size = X_batch.shape[0]
    dw = np.zeros_like(W)

    for i in range(batch_size):
        margin = 1 - (Y_batch[i] * np.dot(X_batch[i], W))
        if margin <= 0:
            dw += W
        else:
            dw += W - (reg_strength * Y_batch[i] * X_batch[i])

    dw /= batch_size
    return dw

Stochoastic Gradient Descent

The Stochastic Gradient Descent (SGD) approach for the SVM optimization problem can be described as: $w^{(t+1)} = w^{(t)} - \eta \nabla J(w^{(t)}, x_i, y_i)$

Where:

$w^{(t)}$ is the weight vector at iteration $t$ .
$\eta$ is the learning rate.
$\nabla J(w^{(t)}, x_i, y_i)$ is the gradient of the cost function with respect to the weight vector, evaluated for a particular sample $(x_i, y_i)$ .

def sgd(features: np.ndarray, outputs: np.ndarray, learning_rate: float = 0.01, max_epochs: int = 5000) -> np.ndarray:
    """
    Performs Stochastic Gradient Descent (SGD) optimization for SVM.

    Parameters:
    - features (np.ndarray): Data matrix where each row is a sample.
    - outputs (np.ndarray): Labels corresponding to each sample.
    - learning_rate (float): Learning rate for the optimization.
    - max_epochs (int): Maximum number of epochs for the SGD optimization.

    Returns:
    - np.ndarray: Optimized weight vector.
    """

    weights = np.zeros(features.shape[1])
    nth = 0
    prev_cost = float('inf')
    cost_threshold = 0.01  # in percent

    for epoch in range(1, max_epochs):
        # Shuffle to prevent repeating update cycles
        X, Y = shuffle(features, outputs)
        for ind, x in enumerate(X):
            ascent = calculate_cost_gradient(weights, x, Y[ind])
            weights -= learning_rate * ascent

        # Convergence check on 2^nth epoch
        if epoch == 2**nth or epoch == max_epochs - 1:
            cost = compute_cost(weights, features, outputs)
            print(f"Epoch is: {epoch} and Cost is: {cost}")

            # Stoppage criterion
            if abs(prev_cost - cost) < cost_threshold * prev_cost:
                return weights

            prev_cost = cost
            nth += 1

    return weights

The update rule is applied for each sample in the dataset. The process is repeated for multiple epochs or until convergence. Convergence can be checked by observing the change in cost between subsequent epochs; if the change is very small (below a set threshold), then the algorithm can be stopped early.

Data Preprocessing

Load Data

data = pd.read_csv('data_banknote_authentication.csv')

Y = data.iloc[:, -1]  
X = data.iloc[:, 1:4]
X.insert(loc=len(X.columns), column='intercept', value=1)

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=42)

Training and Testing the Model

Using our Implementation

# Train the model
print("Training started...")
W = sgd(X_train.to_numpy(), y_train.to_numpy())
print("Training finished.")
print(f"Weights are: {W}")

# Test the model on the test set
y_test_predicted = np.sign(X_test.to_numpy() @ W)
print(f"Accuracy on test dataset: {accuracy_score(y_test.to_numpy(), y_test_predicted)}")

The accuracy on test dataset: 0.31.

Using Sklearn

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Train the model
print("Training started...")
model = SVC()
model.fit(X_train, y_train)
print("Training finished.")

# Test the model on the test set
accuracy = model.score(X_test, y_test)
print(f"Accuracy on the test dataset: {accuracy}")

The accuracy on test dataset: 0.90.

Discussion

Scikit-learn's SVC implementation performs significantly better than the manually implemented SGD. There are several possible reasons for this:

Algorithm Implementation:
- Scikit-learn is a well-maintained library that's been refined over the years. The algorithms it uses are optimized for performance and accuracy, and they have undergone thorough testing. There's a good chance its implementation has some optimizations or refinements not present in the manual implementation.
Kernel Trick:
- The manually implemented SGD appears to be for a linear SVM. This assumes a linear decision boundary which may not fit all datasets well.
- By default, Scikit-learn's SVC uses the Radial basis function (RBF) kernel, which can capture non-linear patterns in the data. This could be a significant factor if your data isn't linearly separable.
Hyperparameters:
- The manually implemented SGD uses fixed hyperparameters, like learning rate, which may not be optimal for the data. On the other hand, SVC has multiple hyperparameters that are fine-tuned for general performance, and you can further optimize them using techniques like grid search.
Regularization:
- Scikit-learn's SVC includes regularization to prevent overfitting, which may not be adequately addressed in the manual implementation.
Convergence Criteria:
- The manual SGD checks convergence based on a set number of epochs and a cost threshold. If the algorithm hasn't converged by the end of the epochs, it may not give an optimal solution.
- Scikit-learn's implementations typically have more sophisticated convergence checks.
Data Preprocessing:
- SVMs, in general, are sensitive to feature scaling. If the data was not scaled before being fed to the manual SGD but was scaled for the Scikit-learn SVM (or vice-versa), this could impact the performance significantly.
Numerical Stability:
- There might be aspects of numerical stability and optimization that Scikit-learn addresses better, leading to more robust learning.
Implementation Errors:
- There might be subtle bugs or inaccuracies in the manually implemented SGD that could affect its performance.