Homework 1: Multi-class Classification using the Avila Dataset
Introduction
This assignment is based on the Avila dataset from the UCI Machine Learning Repository. The Avila Bible, a Latin rendition of the entire Bible, was crafted in the 12th century. Its transcription was labor-intensive, carried out by 12 different copyists. This dataset offers features from 800 images of this book.
The dataset contains continuous features, and the objective is to classify samples into one of 12 possible classes.
Data Preprocessing
Loading the Data
import pandas as pd
train = pd.read_csv("avila-tr.txt", header=None)
test = pd.read_csv("avila-ts.txt", header=None)
x_train, y_train = train.iloc[:, :-1], train.iloc[:, -1]
x_test, y_test = test.iloc[:, :-1], test.iloc[:, -1]
Z-Normalization
To ensure uniform scale across features, we'll z-normalize the data.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(x_train)
x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)
Logistic Regression for Multi-class Classification
One-vs-All (OvA) Strategy
In the OvA strategy, for each class, a binary logistic regression model is trained against all other classes. The class that predicts the highest probability is chosen as the final prediction.
from sklearn.linear_model import LogisticRegression
def train_ova(x, y):
"""
Train a multiclass classifier using the OvA strategy.
"""
labels = y.unique().sort_values()
print(f"Number of classes: {len(labels)}")
models = []
for label in labels:
print(f"Training Logistic Regression model for class {label}")
binary_y = (y == label).astype('int')
model = LogisticRegression().fit(x, binary_y)
models.append(model)
return models, labels
import sys
import numpy as np
def predict_ova(models, labels, x):
"""
Predict multiclass labels using the OvA strategy.
"""
if not models:
sys.exit("Model hasn't been trained yet. Please call train_ova() first.")
probabilities = [model.predict_proba(x)[:, 1] for model in models]
predicted_indices = np.argmax(probabilities, axis=0)
return [labels[i] for i in predicted_indices]
models, labels = train_ova(x_train, y_train)
predictions_ova = predict_ova(models, labels, x_test)
from sklearn.metrics import accuracy_score, confusion_matrix
ova_accuracy = accuracy_score(y_test, predictions_ova)
print(f"Accuracy of OvA classifier: {ova_accuracy:.3f}")
One-vs-One (OvO) Strategy
For OvO, a binary logistic regression model is trained for every pair of classes. Again, the class that predicts the highest probability among all models is chosen.
def train_ovo(x, y):
"""
Train a multiclass classifier using the OvO strategy.
"""
labels = y.unique().sort_values()
models, label_pairs = [], []
for i, label_i in enumerate(labels):
for j, label_j in enumerate(labels[i+1:]):
print(f"Training model to distinguish {label_i} and {label_j}")
subset = y.isin([label_i, label_j])
subset_x, subset_y = x[subset], y[subset]
model = LogisticRegression(solver='liblinear').fit(subset_x, subset_y)
models.append(model)
label_pairs.append((label_i, label_j))
return models, label_pairs
def predict_ovo(models, label_pairs, x):
"""
Predict multiclass labels using the OvO strategy.
"""
if not models:
sys.exit("Model hasn't been trained yet. Please call train_ovo() first.")
votes = pd.DataFrame(columns=[lp[0] for lp in label_pairs])
for model, (label_i, label_j) in zip(models, label_pairs):
predictions = model.predict(x)
votes[label_i] = (predictions == label_i).astype(int)
votes[label_j] = (predictions == label_j).astype(int)
return votes.idxmax(axis=1)
models, label_pairs = train_ovo(x_train, y_train)
predictions_ovo = predict_ovo(models, label_pairs, x_test)
ovo_accuracy = accuracy_score(y_test, predictions_ovo)
print(f"Accuracy of OvO classifier: {ovo_accuracy:.3f}")
Using Scikit-learn's Built-in Strategies
Sklearn provides a direct way to handle multi-class classification with logistic regression using the multi_class='ovr'
argument. It's essentially an OvA approach.
clf_ovr = LogisticRegression(multi_class='ovr').fit(x_train, y_train)
predictions_ovr = clf_ovr.predict(x_test)
ovr_accuracy = accuracy_score(y_test, predictions_ovr)
print(f"Accuracy of OvR classifier: {ovr_accuracy:.3f}")
The multinomial approach trains a single model that directly predicts the probabilities for all classes. It's more computationally efficient than OvA and OvO.
clf_multinomial = LogisticRegression(max_iter=1000, multi_class='multinomial').fit(x_train, y_train)
predictions_multinomial = clf_multinomial.predict(x_test)
multinomial_accuracy = accuracy_score(y_test, predictions_multinomial)
print(f"Accuracy of Multinomial classifier: {multinomial_accuracy:.3f}")
Results
Strategy | Accuracy |
---|---|
OvA | 0.497 |
OvO | 0.571 |
OvR | 0.531 |
Multinomial | 0.561 |
The OvO strategy performs the best, followed by the multinomial approach. The OvA strategy performs the worst, which is expected since it's the most naive approach.