Homework 4. Model Building Part 1: Internet Advertisements

Introduction

This homework assignment is based on the Internet Advertisements Data Set from the UCI Machine Learning Repository. The dataset contains 3,279 observations of 1,558 features. The goal is to predict whether or not an image is an advertisement based on the features.

Data Preprocessing

Load the data

internetAd = pd.read_csv('Internet_Ad_Data.csv', sep=',', error_bad_lines=False)

Replace `?` with `NaN`

internetAd.replace(to_replace=r' *\?', value=np.nan, inplace=True, regex=True)

Binary encoding

internetAd.replace(to_replace='nonad.', value=0, inplace=True, regex=True)
internetAd.replace(to_replace='ad.', value=1, inplace=True, regex=True)

Make sure there is no null values

assert internetAd.isnull().sum().sum() == 0

Case data types to float

internetAd[["height","width","aratio","local"]]= internetAd[["height","width","aratio","local"]].astype("float")

Replace `NaN` with median

internetAd["height"].fillna(value=internetAd["height"].median(), inplace=True)
internetAd["width"].fillna(value=internetAd["width"].median(), inplace=True)
internetAd["aratio"].fillna(value=internetAd["aratio"].median(), inplace=True)
internetAd["local"].fillna(value=internetAd["local"].median(), inplace=True)

Split the data into training and test sets

from sklearn.model_selection import train_test_split

X = internetAd.drop('Target',axis=1)
y = internetAd['Target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Normalize the features

The data so far is problematic because the range of values for each feature varies significantly. The models we use are sensitive to this, so we need to normalize the data. We can do this with MinMaxScaler from sklearn.preprocessing.

How to use it:

Instantiate the scaler and fit it to the training data.
Use the scaler to transform both the training and test data.

Notice that we do not re-fit the scaler to the test data. We only use the scaler that was fit to the training data to transform the test data.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_minmax_scaled = scaler.transform(X_train)
X_test_minmax_scaled = scaler.transform(X_test)

There are also other options:

skleearn.preprocessing.StandardScaler: Standardize features by removing the mean and scaling to unit variance.
sklearn.preprocessing.RobustScaler: Scale features using statistics that are robust to outliers.

Logistic Regression (without regularization)

We specified class_weight='balanced' to account for the imbalanced dataset, which works by automatically assigning weights inversely proportional to class frequencies

clf = LogisticRegression(random_state=0, max_iter=1000, class_weight='balanced',solver='saga').fit(X_train_minmax_scaled, y_train)
y_pred = clf.predict(X_test_minmax_scaled)

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

The results is already pretty good:

              precision    recall  f1-score   support

           0       0.98      0.98      0.98       916
           1       0.87      0.90      0.88       167

    accuracy                           0.96      1083
   macro avg       0.93      0.94      0.93      1083
weighted avg       0.96      0.96      0.96      1083

[[894  22]
 [ 17 150]]

Using Regularization

To prevent overfitting, we can use regularization. To use regularization with logistic regression, we can just specify penalty='l1', penalty='l2', or penalty='elasticnet' in the LogisticRegression function.

The results are comparably good to the previous model.