Homework 4. Support Vector Machine: Banknote Authentication
Introduction
This assignment is again based on UCI's Banknote Authentication Dataset. The dataset contains 1,372 samples of genuine and forged banknotes. Each sample contains four continuous features, and the objective is to classify samples into one of two classes.
Helper Functions
Cost Function
Given the SVM's hinge loss, the loss function with respect to weight vector for a dataset is:
In this formula:
- represents the weight vector.
- is the feature vector of the i-th sample from .
- is the label of the i-th sample from (typically +1 or -1 for binary classification).
- is the regularization strength.
- The term represents L2 regularization, which penalizes large weights to avoid overfitting.
- The term represents the hinge loss for each sample. It measures how much a sample violates the SVM's margin, and is zero for samples that are correctly classified outside the margin.
So, this formula calculates the total cost by adding the L2 regularization term and the hinge loss term for the entire dataset. The hinge loss increases as samples are misclassified or are within the SVM's margin, and the L2 regularization increases as the weights grow in magnitude.
def compute_cost(W: np.ndarray, X: np.ndarray, Y: np.ndarray, reg_strength: float = 1000.0) -> float:
"""
Compute the cost for SVM using the hinge loss and L2 regularization.
Parameters:
- W (np.ndarray): Weight vector.
- X (np.ndarray): Data matrix where each row is a sample.
- Y (np.ndarray): Labels corresponding to each sample.
- reg_strength (float): Regularization strength parameter.
Returns:
- float: Computed cost.
"""
l2_reg = 0.5 * np.linalg.norm(W)**2
hinge_loss = np.sum(np.maximum(0, 1 - Y * X @ W))
cost = (reg_strength * l2_reg) + hinge_loss
return cost
Given the SVM's hinge loss, the gradient of the loss function for a single sample and weight vector is:
def calculate_cost_gradient(W: np.ndarray, X_batch: np.ndarray, Y_batch: np.ndarray, reg_strength: float = 1000.0) -> np.ndarray:
"""
Compute the gradient of the cost for SVM using the hinge loss and L2 regularization.
Parameters:
- W (np.ndarray): Weight vector.
- X_batch (np.ndarray): Data matrix subset (batch) where each row is a sample.
- Y_batch (np.ndarray): Labels corresponding to each sample in the batch.
- reg_strength (float): Regularization strength parameter.
Returns:
- np.ndarray: Gradient of the cost with respect to the weight vector W.
"""
batch_size = X_batch.shape[0]
dw = np.zeros_like(W)
for i in range(batch_size):
margin = 1 - (Y_batch[i] * np.dot(X_batch[i], W))
if margin <= 0:
dw += W
else:
dw += W - (reg_strength * Y_batch[i] * X_batch[i])
dw /= batch_size
return dw
Stochoastic Gradient Descent
The Stochastic Gradient Descent (SGD) approach for the SVM optimization problem can be described as:
Where:
- is the weight vector at iteration .
- is the learning rate.
- is the gradient of the cost function with respect to the weight vector, evaluated for a particular sample .
def sgd(features: np.ndarray, outputs: np.ndarray, learning_rate: float = 0.01, max_epochs: int = 5000) -> np.ndarray:
"""
Performs Stochastic Gradient Descent (SGD) optimization for SVM.
Parameters:
- features (np.ndarray): Data matrix where each row is a sample.
- outputs (np.ndarray): Labels corresponding to each sample.
- learning_rate (float): Learning rate for the optimization.
- max_epochs (int): Maximum number of epochs for the SGD optimization.
Returns:
- np.ndarray: Optimized weight vector.
"""
weights = np.zeros(features.shape[1])
nth = 0
prev_cost = float('inf')
cost_threshold = 0.01 # in percent
for epoch in range(1, max_epochs):
# Shuffle to prevent repeating update cycles
X, Y = shuffle(features, outputs)
for ind, x in enumerate(X):
ascent = calculate_cost_gradient(weights, x, Y[ind])
weights -= learning_rate * ascent
# Convergence check on 2^nth epoch
if epoch == 2**nth or epoch == max_epochs - 1:
cost = compute_cost(weights, features, outputs)
print(f"Epoch is: {epoch} and Cost is: {cost}")
# Stoppage criterion
if abs(prev_cost - cost) < cost_threshold * prev_cost:
return weights
prev_cost = cost
nth += 1
return weights
The update rule is applied for each sample in the dataset. The process is repeated for multiple epochs or until convergence. Convergence can be checked by observing the change in cost between subsequent epochs; if the change is very small (below a set threshold), then the algorithm can be stopped early.
Data Preprocessing
Load Data
data = pd.read_csv('data_banknote_authentication.csv')
Y = data.iloc[:, -1]
X = data.iloc[:, 1:4]
X.insert(loc=len(X.columns), column='intercept', value=1)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=42)
Training and Testing the Model
Using our Implementation
# Train the model
print("Training started...")
W = sgd(X_train.to_numpy(), y_train.to_numpy())
print("Training finished.")
print(f"Weights are: {W}")
# Test the model on the test set
y_test_predicted = np.sign(X_test.to_numpy() @ W)
print(f"Accuracy on test dataset: {accuracy_score(y_test.to_numpy(), y_test_predicted)}")
The accuracy on test dataset: 0.31.
Using Sklearn
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# Train the model
print("Training started...")
model = SVC()
model.fit(X_train, y_train)
print("Training finished.")
# Test the model on the test set
accuracy = model.score(X_test, y_test)
print(f"Accuracy on the test dataset: {accuracy}")
The accuracy on test dataset: 0.90.
Discussion
Scikit-learn's SVC
implementation performs significantly better than the manually implemented SGD. There are several possible reasons for this:
-
Algorithm Implementation:
- Scikit-learn is a well-maintained library that's been refined over the years. The algorithms it uses are optimized for performance and accuracy, and they have undergone thorough testing. There's a good chance its implementation has some optimizations or refinements not present in the manual implementation.
-
Kernel Trick:
- The manually implemented SGD appears to be for a linear SVM. This assumes a linear decision boundary which may not fit all datasets well.
- By default, Scikit-learn's
SVC
uses the Radial basis function (RBF) kernel, which can capture non-linear patterns in the data. This could be a significant factor if your data isn't linearly separable.
-
Hyperparameters:
- The manually implemented SGD uses fixed hyperparameters, like learning rate, which may not be optimal for the data. On the other hand,
SVC
has multiple hyperparameters that are fine-tuned for general performance, and you can further optimize them using techniques like grid search.
- The manually implemented SGD uses fixed hyperparameters, like learning rate, which may not be optimal for the data. On the other hand,
-
Regularization:
- Scikit-learn's
SVC
includes regularization to prevent overfitting, which may not be adequately addressed in the manual implementation.
- Scikit-learn's
-
Convergence Criteria:
- The manual SGD checks convergence based on a set number of epochs and a cost threshold. If the algorithm hasn't converged by the end of the epochs, it may not give an optimal solution.
- Scikit-learn's implementations typically have more sophisticated convergence checks.
-
Data Preprocessing:
- SVMs, in general, are sensitive to feature scaling. If the data was not scaled before being fed to the manual SGD but was scaled for the Scikit-learn SVM (or vice-versa), this could impact the performance significantly.
-
Numerical Stability:
- There might be aspects of numerical stability and optimization that Scikit-learn addresses better, leading to more robust learning.
-
Implementation Errors:
- There might be subtle bugs or inaccuracies in the manually implemented SGD that could affect its performance.