Homework 5. Model Building Part 2: Predictive Maintenance

Introduction

In this assignment, we will be using the Predictive Maintenance dataset from the UCI Machine Learning Repository. The dataset contains 10000 data points with 14 features each. The goal is to predict the tool wear of a milling machine based on the given features.


Data Preprocessing

Load the data

ai4i2020 = pd.read_csv('ai4i2020.csv')

Convert columns to numeric

ai4i2020['Air temperature [K]'] = pd.to_numeric(ai4i2020['Air temperature [K]'])
ai4i2020['Process temperature [K]'] = pd.to_numeric(ai4i2020['Process temperature [K]'])

Replace missing values with mean

ai4i2020["Air temperature [K]"].fillna(ai4i2020['Air temperature [K]'].mean(), inplace=True)
ai4i2020["Process temperature [K]"].fillna(ai4i2020['Process temperature [K]'].mean(), inplace=True)

Drop unneeded columns

ai4i2020.drop(['UDI', 'Product ID'],axis=1, inplace=True)

Test-train split

from sklearn.model_selection import train_test_split
X = ai4i2020.drop(['Machine failure'],axis=1)
y = ai4i2020['Machine failure']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

One-hot encoding

from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(X_train)
X_train_enc = enc.transform(X_train)
X_test_enc = enc.transform(X_test)

Handle class imbalance

Here, we use SMOTE to oversample the minority class.

from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train_enc, y_train)

Model building

Here we trained 5 different models and compared their performance.

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


models = {'Logistic Regresion': LogisticRegression(),
          'Support Vector Machine': SVC(),
          'K-NN': KNeighborsClassifier(),
          'Decision Tree':DecisionTreeClassifier(),
          'XGBoost': XGBClassifier(),}

for model in models.keys():
    models[model].fit(X_res, y_res)

    y_pred = models[model].predict(X_test_enc)
    print (model)
    print(classification_report(y_test, y_pred))
    print(confusion_matrix(y_test, y_pred))

See the results below:

Classifier Class Precision Recall F1-Score Support
Logistic Regression 0 0.98 0.96 0.97 2907
1 0.21 0.31 0.25 93
Accuracy 0.94 3000
Support Vector Machine 0 0.97 1.00 0.98 2907
1 0.33 0.03 0.06 93
Accuracy 0.97 3000
K-NN 0 0.98 0.52 0.68 2907
1 0.04 0.69 0.08 93
Accuracy 0.53 3000
Decision Tree 0 0.97 0.98 0.98 2907
1 0.11 0.06 0.08 93
Accuracy 0.95 3000
XGBoost 0 0.97 1.00 0.98 2907
1 0.29 0.02 0.04 93
Accuracy 0.97 3000

Observations

  • Class imbalance:

    • The accuracy for most of the classifiers (except K-NN) is high, hovering around 94% to 97%. This suggests that, at first glance, these models appear to be doing a good job.

    • However, when we inspect class-specific metrics, we notice that while the metrics for class 0 are high across all classifiers, those for class 1 are consistently lower. The performance of classifiers on class 1 is especially concerning for metrics like recall and f1-score, which are quite low for almost all classifiers.

    • Also, note that using SMOTE balances the training dataset, but it doesn't change the testing dataset. So that's why the support for class 1 is still 93.

  • Which model is better?

    • If predicting class 1 correctly (even if it means having more false positives) is a priority, then the K-NN might be considered better because of its high recall. But remember, it comes at the cost of a lot of false positives.

    • If a balance between predicting class 1 correctly and not having too many false positives is needed, Logistic Regression appears to be the best compromise.

  • How to Improve?

    • Resampling: You can either oversample the minority class (i.e., class 1) or undersample the majority class to create a balanced dataset.
    • Change Evaluation Metric: Instead of accuracy, focus on metrics like F1-Score, Precision, Recall, or AUC-ROC which give a better idea about the performance on imbalanced datasets.
    • Class Weights: Assign higher weights to the minority class. Many algorithms, including logistic regression, allow you to set class weights.
    • Ensemble Techniques: Using techniques like bagging and boosting. They can improve performance by combining multiple models.
    • Anomaly Detection: If class 1 is rare and represents anomalies, anomaly detection techniques might be more appropriate than standard classification approaches.

Feature Selection

Here we use Recursive Feature Elimination (RFE) to select the top 3 features.

from sklearn.feature_selection import RFE

log_rgr = LogisticRegression(random_state=5, max_iter=500)

rfe = RFE(log_rgr, n_features_to_select=3)
rfe.fit(X_res, y_res)

y_pred = rfe.predict(X_test_enc)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

Here are the results:

Class Precision Recall F1-Score Support
0 0.97 1.00 0.98 2907
1 0.62 0.05 0.10 93

The performance on class 1 is still quite poor. This suggests that the features we selected are not very good at predicting class 1.