Homework 6. Resampling Methods: SECOM
Introduction
This assignment is based on the SECOM dataset from the UCI Machine Learning Repository. The dataset contains 1567 examples of process measurements recorded by sensors during the manufacturing of semiconductor products. The goal is to predict whether or not a product is likely to fail quality control (binary classification).
Data Preprocessing
Load Data
data = pd.read_csv('Secom.csv')
Handle Missing Data
# Replace all values with '!' possibly surrounded by spaces
data.replace(to_replace=r'\s*!\s*', value=np.nan, regex=True, inplace=True)
Fill Missing Values
data.fillna(0, inplace = True, axis=1)
Check if there are any missing values left:
assert data.isnull().sum().sum() == 0
Split Data
X = data.drop(['Target'], axis=1)
y = data.Target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
Logistic Regression
Fit and Predict
from sklearn.linear_model import LogisticRegression
#Create linear regression object
regr = LogisticRegression()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
Plot ROC Curve
import matplotlib.pyplot as plt
from sklearn import metrics
%matplotlib inline
# calculate the fpr and tpr for all thresholds of the classification
probs = regr.predict_proba(X_test)
preds = probs[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
roc_auc = metrics.auc(fpr, tpr)
print("AUC: {}".format(roc_auc))
# Plot the ROC curve
import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
The result is an AUC of 0.55. This is a poor result, as a random classifier would have an AUC of 0.5. Let's see how resampling methods can improve this result.
Resampling Methods
Recursive Feature Elimination (RFE)
from sklearn.feature_selection import RFE
regr = RFE(LogisticRegression(), n_features_to_select=40, step=1)
regr.fit(X_train, y_train)
With 590 sensors, this just isn't the most efficient way to select features. The code takes a long time to run.
The AUC is 0.77.
Stratified K-Fold Cross-Validation
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)
skf.get_n_splits(X, y)
for train_index, test_index in skf.split(X, y):
X_train, X_test = X.to_numpy()[train_index], X.to_numpy()[test_index]
y_train, y_test = y.to_numpy()[train_index], y.to_numpy()[test_index]
regr = LogisticRegression(max_iter=5000)
regr.fit(X_train, y_train)
# calculate the fpr and tpr for all thresholds of the classification
probs = regr.predict_proba(X_test)
preds = probs[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
roc_auc = metrics.auc(fpr, tpr)
print("AUC: {}".format(roc_auc))
# Plot the ROC curve
import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
The AUC hovers around 0.6. This is also a poor result.
Discussion
The SECOM dataset contains sensor readings during semiconductor manufacturing with a goal of predicting product quality. After initial preprocessing, which included handling missing values, a Logistic Regression model was applied, resulting in an AUC of 0.55—barely better than a random guess.
To enhance model performance, Recursive Feature Elimination (RFE) was used to pinpoint the top 40 features from the initial 590. This approach, despite its computational cost, raised the AUC to 0.77. However, Stratified K-Fold Cross-Validation, intended to address class imbalance, only achieved an AUC around 0.6.
In summary, while RFE showed promise, other techniques or models may be needed to further refine predictions on the SECOM dataset.