Homework 1: Multi-class Classification using the Avila Dataset

Introduction

This assignment is based on the Avila dataset from the UCI Machine Learning Repository. The Avila Bible, a Latin rendition of the entire Bible, was crafted in the 12th century. Its transcription was labor-intensive, carried out by 12 different copyists. This dataset offers features from 800 images of this book.

The dataset contains continuous features, and the objective is to classify samples into one of 12 possible classes.

Data Preprocessing

Loading the Data

import pandas as pd

train = pd.read_csv("avila-tr.txt", header=None)
test = pd.read_csv("avila-ts.txt", header=None)

x_train, y_train = train.iloc[:, :-1], train.iloc[:, -1]
x_test, y_test = test.iloc[:, :-1], test.iloc[:, -1]

Z-Normalization

To ensure uniform scale across features, we'll z-normalize the data.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(x_train)
x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)

Logistic Regression for Multi-class Classification

One-vs-All (OvA) Strategy

In the OvA strategy, for each class, a binary logistic regression model is trained against all other classes. The class that predicts the highest probability is chosen as the final prediction.

from sklearn.linear_model import LogisticRegression

def train_ova(x, y):
    """
    Train a multiclass classifier using the OvA strategy. 
    """
    labels = y.unique().sort_values()
    print(f"Number of classes: {len(labels)}")

    models = []
    for label in labels:
        print(f"Training Logistic Regression model for class {label}")
        binary_y = (y == label).astype('int')

        model = LogisticRegression().fit(x, binary_y)
        models.append(model)
    return models, labels

import sys
import numpy as np

def predict_ova(models, labels, x):
    """
    Predict multiclass labels using the OvA strategy.
    """
    if not models:
        sys.exit("Model hasn't been trained yet. Please call train_ova() first.")

    probabilities = [model.predict_proba(x)[:, 1] for model in models]
    predicted_indices = np.argmax(probabilities, axis=0)
    return [labels[i] for i in predicted_indices]

models, labels = train_ova(x_train, y_train)
predictions_ova = predict_ova(models, labels, x_test)

from sklearn.metrics import accuracy_score, confusion_matrix

ova_accuracy = accuracy_score(y_test, predictions_ova)
print(f"Accuracy of OvA classifier: {ova_accuracy:.3f}")

One-vs-One (OvO) Strategy

For OvO, a binary logistic regression model is trained for every pair of classes. Again, the class that predicts the highest probability among all models is chosen.

def train_ovo(x, y):
    """
    Train a multiclass classifier using the OvO strategy.
    """
    labels = y.unique().sort_values()
    models, label_pairs = [], []

    for i, label_i in enumerate(labels):
        for j, label_j in enumerate(labels[i+1:]):
            print(f"Training model to distinguish {label_i} and {label_j}")
            subset = y.isin([label_i, label_j])
            subset_x, subset_y = x[subset], y[subset]

            model = LogisticRegression(solver='liblinear').fit(subset_x, subset_y)
            models.append(model)
            label_pairs.append((label_i, label_j))
    return models, label_pairs

def predict_ovo(models, label_pairs, x):
    """
    Predict multiclass labels using the OvO strategy.
    """
    if not models:
        sys.exit("Model hasn't been trained yet. Please call train_ovo() first.")

    votes = pd.DataFrame(columns=[lp[0] for lp in label_pairs])
    for model, (label_i, label_j) in zip(models, label_pairs):
        predictions = model.predict(x)
        votes[label_i] = (predictions == label_i).astype(int)
        votes[label_j] = (predictions == label_j).astype(int)

    return votes.idxmax(axis=1)

models, label_pairs = train_ovo(x_train, y_train)
predictions_ovo = predict_ovo(models, label_pairs, x_test)
ovo_accuracy = accuracy_score(y_test, predictions_ovo)
print(f"Accuracy of OvO classifier: {ovo_accuracy:.3f}")

Using Scikit-learn's Built-in Strategies

Sklearn provides a direct way to handle multi-class classification with logistic regression using the multi_class='ovr' argument. It's essentially an OvA approach.

clf_ovr = LogisticRegression(multi_class='ovr').fit(x_train, y_train)
predictions_ovr = clf_ovr.predict(x_test)
ovr_accuracy = accuracy_score(y_test, predictions_ovr)
print(f"Accuracy of OvR classifier: {ovr_accuracy:.3f}")

The multinomial approach trains a single model that directly predicts the probabilities for all classes. It's more computationally efficient than OvA and OvO.

clf_multinomial = LogisticRegression(max_iter=1000, multi_class='multinomial').fit(x_train, y_train)
predictions_multinomial = clf_multinomial.predict(x_test)
multinomial_accuracy = accuracy_score(y_test, predictions_multinomial)
print(f"Accuracy of Multinomial classifier: {multinomial_accuracy:.3f}")

Results

Strategy	Accuracy
OvA	0.497
OvO	0.571
OvR	0.531
Multinomial	0.561

The OvO strategy performs the best, followed by the multinomial approach. The OvA strategy performs the worst, which is expected since it's the most naive approach.