Homework 8. Dimension Reduction Methods: MNIST
Introduction
This assignment is about dimension reduction methods. We will be using the MNIST dataset, which contains 70,000 images of handwritten digits. Each image is 28x28 pixels, and each pixel is represented by a value from 0 to 255. The goal is to reduce the dimensionality of the dataset while preserving as much information as possible.
Data Preprocessing
Load Data
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
Split data
X_train = mnist['data'][:60000]
y_train = mnist['target'][:60000]
X_test = mnist['data'][60000:]
y_test = mnist['target'][60000:]
Logistic Regression
Train model
from sklearn.linear_model import LogisticRegression
log_rgr = LogisticRegression(multi_class="multinomial", solver="lbfgs", random_state=42)
log_rgr.fit(X_train, y_train)
Evaluate model
We use the score
method to do the prediction and evaluation in one step.
score = log_rgr.score(X_test, y_test)
print(score)
The accuracy of the model is 0.9255, but the training time is quite long. Let's see if we can improve the performance by dimension reduction.
Principal Component Analysis (PCA)
Fit and transform
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)
X_train_reduced = pca.fit_transform(X_train)
Train model
log_clf2 = LogisticRegression(multi_class="multinomial", solver="lbfgs", random_state=42)
log_clf2.fit(X_train_reduced, y_train)
Evaluate model
X_test_reduced = pca.transform(X_test)
score = log_clf2.score(X_test_reduced, y_test)
print(score)
The accuracy of the model is 0.9205, which is slightly worse than the original model. However, the training time is significantly shorter.
Discussion
In the MNIST dataset exercise, we compared the performance of logistic regression on the original dataset to its performance on data reduced via PCA. Without dimensionality reduction, the model achieved an accuracy of 0.9255 but had a longer training time. Post-PCA, training was faster, but there was a slight dip in accuracy to 0.9205. This highlights a trade-off between computational efficiency and accuracy. Real-world systems, especially those needing quick decisions or facing resource constraints, might prioritize efficiency over a minor accuracy dip.
Future work could delve into other dimensionality reduction methods, hyperparameter tuning, and ensemble techniques to find an optimal balance for specific needs.