Homework 6. Resampling Methods: SECOM

Introduction

This assignment is based on the SECOM dataset from the UCI Machine Learning Repository. The dataset contains 1567 examples of process measurements recorded by sensors during the manufacturing of semiconductor products. The goal is to predict whether or not a product is likely to fail quality control (binary classification).


Data Preprocessing

Load Data

data = pd.read_csv('Secom.csv')

Handle Missing Data

# Replace all values with '!' possibly surrounded by spaces
data.replace(to_replace=r'\s*!\s*', value=np.nan, regex=True, inplace=True)

Fill Missing Values

data.fillna(0, inplace = True, axis=1)

Check if there are any missing values left:

assert data.isnull().sum().sum() == 0

Split Data

X = data.drop(['Target'], axis=1)
y = data.Target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

Logistic Regression

Fit and Predict

from sklearn.linear_model import LogisticRegression

#Create linear regression object
regr = LogisticRegression()
regr.fit(X_train, y_train)

y_pred = regr.predict(X_test)

Plot ROC Curve

import matplotlib.pyplot as plt
from sklearn import metrics

%matplotlib inline

# calculate the fpr and tpr for all thresholds of the classification
probs = regr.predict_proba(X_test)
preds = probs[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
roc_auc = metrics.auc(fpr, tpr)
print("AUC: {}".format(roc_auc))

# Plot the ROC curve
import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

The result is an AUC of 0.55. This is a poor result, as a random classifier would have an AUC of 0.5. Let's see how resampling methods can improve this result.

roc1


Resampling Methods

Recursive Feature Elimination (RFE)

from sklearn.feature_selection import RFE

regr = RFE(LogisticRegression(), n_features_to_select=40, step=1)
regr.fit(X_train, y_train)

With 590 sensors, this just isn't the most efficient way to select features. The code takes a long time to run.

roc_rfe

The AUC is 0.77.

Stratified K-Fold Cross-Validation

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5)
skf.get_n_splits(X, y)

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X.to_numpy()[train_index], X.to_numpy()[test_index]
    y_train, y_test = y.to_numpy()[train_index], y.to_numpy()[test_index]
    regr = LogisticRegression(max_iter=5000)
    regr.fit(X_train, y_train)
    # calculate the fpr and tpr for all thresholds of the classification
    probs = regr.predict_proba(X_test)
    preds = probs[:,1]
    fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
    roc_auc = metrics.auc(fpr, tpr)
    print("AUC: {}".format(roc_auc))

    # Plot the ROC curve
    import matplotlib.pyplot as plt
    plt.title('Receiver Operating Characteristic')
    plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
    plt.legend(loc = 'lower right')
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()

The AUC hovers around 0.6. This is also a poor result.


Discussion

The SECOM dataset contains sensor readings during semiconductor manufacturing with a goal of predicting product quality. After initial preprocessing, which included handling missing values, a Logistic Regression model was applied, resulting in an AUC of 0.55—barely better than a random guess.

To enhance model performance, Recursive Feature Elimination (RFE) was used to pinpoint the top 40 features from the initial 590. This approach, despite its computational cost, raised the AUC to 0.77. However, Stratified K-Fold Cross-Validation, intended to address class imbalance, only achieved an AUC around 0.6.

In summary, while RFE showed promise, other techniques or models may be needed to further refine predictions on the SECOM dataset.