Homework 5. Ensemble Methods: Internet Advertisements

Introduction

This assignment is based on Internet Advertisements Data Set. The dataset contains 3,279 samples of internet advertisements. Each sample contains 1,557 binary features, and the objective is to classify samples into one of two classes.


Data Preprocessing

Load the Data

internetAd = pd.read_csv('Internet_Ad_Data.csv', sep=',', error_bad_lines=False)

Impute missing values

internetAd.replace({"\s*\?\s*": np.nan}, regex=True, inplace=True)
internetAd.iloc[:, :-1] = internetAd.iloc[:, :-1].apply(pd.to_numeric, errors='coerce')
internetAd.iloc[:, :-1] = internetAd.iloc[:, :-1].apply(lambda x: x.fillna(x.median()), axis=0)

Split the data into training and test sets

from sklearn.model_selection import train_test_split

X = internetAd.iloc[:, :-1]
y = internetAd.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Model Training

Logistic Regression

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(random_state=0)
log_reg.fit(X_train, y_train)

# make predictions with the trained random forest
test_z = log_reg.predict(X_test)
test_z_prob = log_reg.predict_proba(X_test)[:,1]

# evaluate the model performance - AUC and ROC
print('Logistic Regression Accuracy: ', accuracy_score(y_test, test_z))
print('Logistic Regression AUC: ', roc_auc_score(y_test, test_z_prob))

Logistic Regression Accuracy: 0.958 Logistic Regression AUC: 0.978

Bagging

from sklearn.ensemble import BaggingClassifier

bagOLR = BaggingClassifier(LogisticRegression(random_state=0), max_samples=0.5, max_features=0.5, n_jobs=-1)
bagOLR.fit(X_train, y_train)

# make predictions with the trained random forest
test_z = bagOLR.predict(X_test)
test_z_prob = bagOLR.predict_proba(X_test)[:,1]

# evaluate the model performance - AUC and ROC
print('Bagging Classifier Accuracy: ', accuracy_score(y_test, test_z))
print('Bagging Classifier AUC: ', roc_auc_score(y_test, test_z_prob))

Bagging Classifier Accuracy: 0.955 Bagging Classifier AUC: 0.981

AdaBoost

from sklearn.ensemble import AdaBoostClassifier

boostOkLR = AdaBoostClassifier(LogisticRegression(random_state=0))
boostOkLR.fit(X_train, y_train)

# make predictions with the trained random forest
test_z = boostOkLR.predict(X_test)
test_z_prob = boostOkLR.predict_proba(X_test)[:,1]

# evaluate the model performance - AUC and ROC
print('AdaBoost Classifier Accuracy: ', accuracy_score(y_test, test_z))
print('AdaBoost Classifier AUC: ', roc_auc_score(y_test, test_z_prob))

AdaBoost Classifier Accuracy: 0.954 AdaBoost Classifier AUC: 0.978

Stacking

from sklearn.ensemble import StackingClassifier

estimators = [('lr', LogisticRegression()),
              ('bag', BaggingClassifier(LogisticRegression(), max_samples=0.5, max_features=0.5, n_jobs=-1)),
              ('ada', AdaBoostClassifier(LogisticRegression()))]
stk = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
stk.fit(X_train, y_train)

# make predictions with the trained model
test_z = stk.predict(X_test)
test_z_prob = stk.predict_proba(X_test)[:,1]

# evaluate the model performance - AUC and ROC
print('Stacking Classifier Accuracy: ', accuracy_score(y_test, test_z))
print('Stacking Classifier AUC: ', roc_auc_score(y_test, test_z_prob))

Stacking Classifier Accuracy: 0.960 Stacking Classifier AUC: 0.983


Results

Model Accuracy AUC
Logistic Regression 0.958 0.978
Bagging 0.955 0.981
AdaBoost 0.954 0.978
Stacking 0.960 0.983

All models performed quite well, with accuracies above 95% and AUC values close to 1. This suggests the data might be relatively easy to classify, or the features are very indicative of the target variable. The Stacking Classifier performed the best in terms of both accuracy and AUC. This suggests that leveraging the strengths of multiple models can be beneficial.