Homework 5. Ensemble Methods: Internet Advertisements
Introduction
This assignment is based on Internet Advertisements Data Set. The dataset contains 3,279 samples of internet advertisements. Each sample contains 1,557 binary features, and the objective is to classify samples into one of two classes.
Data Preprocessing
Load the Data
internetAd = pd.read_csv('Internet_Ad_Data.csv', sep=',', error_bad_lines=False)
Impute missing values
internetAd.replace({"\s*\?\s*": np.nan}, regex=True, inplace=True)
internetAd.iloc[:, :-1] = internetAd.iloc[:, :-1].apply(pd.to_numeric, errors='coerce')
internetAd.iloc[:, :-1] = internetAd.iloc[:, :-1].apply(lambda x: x.fillna(x.median()), axis=0)
Split the data into training and test sets
from sklearn.model_selection import train_test_split
X = internetAd.iloc[:, :-1]
y = internetAd.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
Model Training
Logistic Regression
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(random_state=0)
log_reg.fit(X_train, y_train)
# make predictions with the trained random forest
test_z = log_reg.predict(X_test)
test_z_prob = log_reg.predict_proba(X_test)[:,1]
# evaluate the model performance - AUC and ROC
print('Logistic Regression Accuracy: ', accuracy_score(y_test, test_z))
print('Logistic Regression AUC: ', roc_auc_score(y_test, test_z_prob))
Logistic Regression Accuracy: 0.958 Logistic Regression AUC: 0.978
Bagging
from sklearn.ensemble import BaggingClassifier
bagOLR = BaggingClassifier(LogisticRegression(random_state=0), max_samples=0.5, max_features=0.5, n_jobs=-1)
bagOLR.fit(X_train, y_train)
# make predictions with the trained random forest
test_z = bagOLR.predict(X_test)
test_z_prob = bagOLR.predict_proba(X_test)[:,1]
# evaluate the model performance - AUC and ROC
print('Bagging Classifier Accuracy: ', accuracy_score(y_test, test_z))
print('Bagging Classifier AUC: ', roc_auc_score(y_test, test_z_prob))
Bagging Classifier Accuracy: 0.955 Bagging Classifier AUC: 0.981
AdaBoost
from sklearn.ensemble import AdaBoostClassifier
boostOkLR = AdaBoostClassifier(LogisticRegression(random_state=0))
boostOkLR.fit(X_train, y_train)
# make predictions with the trained random forest
test_z = boostOkLR.predict(X_test)
test_z_prob = boostOkLR.predict_proba(X_test)[:,1]
# evaluate the model performance - AUC and ROC
print('AdaBoost Classifier Accuracy: ', accuracy_score(y_test, test_z))
print('AdaBoost Classifier AUC: ', roc_auc_score(y_test, test_z_prob))
AdaBoost Classifier Accuracy: 0.954 AdaBoost Classifier AUC: 0.978
Stacking
from sklearn.ensemble import StackingClassifier
estimators = [('lr', LogisticRegression()),
('bag', BaggingClassifier(LogisticRegression(), max_samples=0.5, max_features=0.5, n_jobs=-1)),
('ada', AdaBoostClassifier(LogisticRegression()))]
stk = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
stk.fit(X_train, y_train)
# make predictions with the trained model
test_z = stk.predict(X_test)
test_z_prob = stk.predict_proba(X_test)[:,1]
# evaluate the model performance - AUC and ROC
print('Stacking Classifier Accuracy: ', accuracy_score(y_test, test_z))
print('Stacking Classifier AUC: ', roc_auc_score(y_test, test_z_prob))
Stacking Classifier Accuracy: 0.960 Stacking Classifier AUC: 0.983
Results
Model | Accuracy | AUC |
---|---|---|
Logistic Regression | 0.958 | 0.978 |
Bagging | 0.955 | 0.981 |
AdaBoost | 0.954 | 0.978 |
Stacking | 0.960 | 0.983 |
All models performed quite well, with accuracies above 95% and AUC values close to 1. This suggests the data might be relatively easy to classify, or the features are very indicative of the target variable. The Stacking Classifier performed the best in terms of both accuracy and AUC. This suggests that leveraging the strengths of multiple models can be beneficial.