Homework 8. Dimension Reduction Methods: MNIST

Introduction

This assignment is about dimension reduction methods. We will be using the MNIST dataset, which contains 70,000 images of handwritten digits. Each image is 28x28 pixels, and each pixel is represented by a value from 0 to 255. The goal is to reduce the dimensionality of the dataset while preserving as much information as possible.


Data Preprocessing

Load Data

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')

Split data

X_train = mnist['data'][:60000]
y_train = mnist['target'][:60000]

X_test = mnist['data'][60000:]
y_test = mnist['target'][60000:]

Logistic Regression

Train model

from sklearn.linear_model import LogisticRegression

log_rgr = LogisticRegression(multi_class="multinomial", solver="lbfgs", random_state=42)
log_rgr.fit(X_train, y_train)

Evaluate model

We use the score method to do the prediction and evaluation in one step.

score = log_rgr.score(X_test, y_test)
print(score)

The accuracy of the model is 0.9255, but the training time is quite long. Let's see if we can improve the performance by dimension reduction.


Principal Component Analysis (PCA)

Fit and transform

from sklearn.decomposition import PCA

pca = PCA(n_components=0.95)
X_train_reduced = pca.fit_transform(X_train)

Train model

log_clf2 = LogisticRegression(multi_class="multinomial", solver="lbfgs", random_state=42)
log_clf2.fit(X_train_reduced, y_train)

Evaluate model

X_test_reduced = pca.transform(X_test)

score = log_clf2.score(X_test_reduced, y_test)
print(score)

The accuracy of the model is 0.9205, which is slightly worse than the original model. However, the training time is significantly shorter.


Discussion

In the MNIST dataset exercise, we compared the performance of logistic regression on the original dataset to its performance on data reduced via PCA. Without dimensionality reduction, the model achieved an accuracy of 0.9255 but had a longer training time. Post-PCA, training was faster, but there was a slight dip in accuracy to 0.9205. This highlights a trade-off between computational efficiency and accuracy. Real-world systems, especially those needing quick decisions or facing resource constraints, might prioritize efficiency over a minor accuracy dip.

Future work could delve into other dimensionality reduction methods, hyperparameter tuning, and ensemble techniques to find an optimal balance for specific needs.