Homework 5. Model Building Part 2: Predictive Maintenance
Introduction
In this assignment, we will be using the Predictive Maintenance dataset from the UCI Machine Learning Repository. The dataset contains 10000 data points with 14 features each. The goal is to predict the tool wear of a milling machine based on the given features.
Data Preprocessing
Load the data
ai4i2020 = pd.read_csv('ai4i2020.csv')
Convert columns to numeric
ai4i2020['Air temperature [K]'] = pd.to_numeric(ai4i2020['Air temperature [K]'])
ai4i2020['Process temperature [K]'] = pd.to_numeric(ai4i2020['Process temperature [K]'])
Replace missing values with mean
ai4i2020["Air temperature [K]"].fillna(ai4i2020['Air temperature [K]'].mean(), inplace=True)
ai4i2020["Process temperature [K]"].fillna(ai4i2020['Process temperature [K]'].mean(), inplace=True)
Drop unneeded columns
ai4i2020.drop(['UDI', 'Product ID'],axis=1, inplace=True)
Test-train split
from sklearn.model_selection import train_test_split
X = ai4i2020.drop(['Machine failure'],axis=1)
y = ai4i2020['Machine failure']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
One-hot encoding
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(X_train)
X_train_enc = enc.transform(X_train)
X_test_enc = enc.transform(X_test)
Handle class imbalance
Here, we use SMOTE to oversample the minority class.
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train_enc, y_train)
Model building
Here we trained 5 different models and compared their performance.
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
models = {'Logistic Regresion': LogisticRegression(),
'Support Vector Machine': SVC(),
'K-NN': KNeighborsClassifier(),
'Decision Tree':DecisionTreeClassifier(),
'XGBoost': XGBClassifier(),}
for model in models.keys():
models[model].fit(X_res, y_res)
y_pred = models[model].predict(X_test_enc)
print (model)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
See the results below:
Classifier | Class | Precision | Recall | F1-Score | Support |
---|---|---|---|---|---|
Logistic Regression | 0 | 0.98 | 0.96 | 0.97 | 2907 |
1 | 0.21 | 0.31 | 0.25 | 93 | |
Accuracy | 0.94 | 3000 | |||
Support Vector Machine | 0 | 0.97 | 1.00 | 0.98 | 2907 |
1 | 0.33 | 0.03 | 0.06 | 93 | |
Accuracy | 0.97 | 3000 | |||
K-NN | 0 | 0.98 | 0.52 | 0.68 | 2907 |
1 | 0.04 | 0.69 | 0.08 | 93 | |
Accuracy | 0.53 | 3000 | |||
Decision Tree | 0 | 0.97 | 0.98 | 0.98 | 2907 |
1 | 0.11 | 0.06 | 0.08 | 93 | |
Accuracy | 0.95 | 3000 | |||
XGBoost | 0 | 0.97 | 1.00 | 0.98 | 2907 |
1 | 0.29 | 0.02 | 0.04 | 93 | |
Accuracy | 0.97 | 3000 |
Observations
-
Class imbalance:
-
The accuracy for most of the classifiers (except K-NN) is high, hovering around 94% to 97%. This suggests that, at first glance, these models appear to be doing a good job.
-
However, when we inspect class-specific metrics, we notice that while the metrics for class 0 are high across all classifiers, those for class 1 are consistently lower. The performance of classifiers on class 1 is especially concerning for metrics like recall and f1-score, which are quite low for almost all classifiers.
-
Also, note that using SMOTE balances the training dataset, but it doesn't change the testing dataset. So that's why the support for class 1 is still 93.
-
-
Which model is better?
-
If predicting class 1 correctly (even if it means having more false positives) is a priority, then the K-NN might be considered better because of its high recall. But remember, it comes at the cost of a lot of false positives.
-
If a balance between predicting class 1 correctly and not having too many false positives is needed, Logistic Regression appears to be the best compromise.
-
-
How to Improve?
- Resampling: You can either oversample the minority class (i.e., class 1) or undersample the majority class to create a balanced dataset.
- Change Evaluation Metric: Instead of accuracy, focus on metrics like F1-Score, Precision, Recall, or AUC-ROC which give a better idea about the performance on imbalanced datasets.
- Class Weights: Assign higher weights to the minority class. Many algorithms, including logistic regression, allow you to set class weights.
- Ensemble Techniques: Using techniques like bagging and boosting. They can improve performance by combining multiple models.
- Anomaly Detection: If class 1 is rare and represents anomalies, anomaly detection techniques might be more appropriate than standard classification approaches.
Feature Selection
Here we use Recursive Feature Elimination (RFE) to select the top 3 features.
from sklearn.feature_selection import RFE
log_rgr = LogisticRegression(random_state=5, max_iter=500)
rfe = RFE(log_rgr, n_features_to_select=3)
rfe.fit(X_res, y_res)
y_pred = rfe.predict(X_test_enc)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
Here are the results:
Class | Precision | Recall | F1-Score | Support |
---|---|---|---|---|
0 | 0.97 | 1.00 | 0.98 | 2907 |
1 | 0.62 | 0.05 | 0.10 | 93 |
The performance on class 1 is still quite poor. This suggests that the features we selected are not very good at predicting class 1.