Homework 4. Model Building Part 1: Internet Advertisements
Introduction
This homework assignment is based on the Internet Advertisements Data Set from the UCI Machine Learning Repository. The dataset contains 3,279 observations of 1,558 features. The goal is to predict whether or not an image is an advertisement based on the features.
Data Preprocessing
Load the data
internetAd = pd.read_csv('Internet_Ad_Data.csv', sep=',', error_bad_lines=False)
Replace ?
with NaN
internetAd.replace(to_replace=r' *\?', value=np.nan, inplace=True, regex=True)
Binary encoding
internetAd.replace(to_replace='nonad.', value=0, inplace=True, regex=True)
internetAd.replace(to_replace='ad.', value=1, inplace=True, regex=True)
Make sure there is no null values
assert internetAd.isnull().sum().sum() == 0
Case data types to float
internetAd[["height","width","aratio","local"]]= internetAd[["height","width","aratio","local"]].astype("float")
Replace NaN
with median
internetAd["height"].fillna(value=internetAd["height"].median(), inplace=True)
internetAd["width"].fillna(value=internetAd["width"].median(), inplace=True)
internetAd["aratio"].fillna(value=internetAd["aratio"].median(), inplace=True)
internetAd["local"].fillna(value=internetAd["local"].median(), inplace=True)
Split the data into training and test sets
from sklearn.model_selection import train_test_split
X = internetAd.drop('Target',axis=1)
y = internetAd['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
Normalize the features
The data so far is problematic because the range of values for each feature varies significantly. The models we use are sensitive to this, so we need to normalize the data. We can do this with MinMaxScaler
from sklearn.preprocessing
.
How to use it:
- Instantiate the scaler and fit it to the training data.
- Use the scaler to transform both the training and test data.
Notice that we do not re-fit the scaler to the test data. We only use the scaler that was fit to the training data to transform the test data.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_minmax_scaled = scaler.transform(X_train)
X_test_minmax_scaled = scaler.transform(X_test)
There are also other options:
skleearn.preprocessing.StandardScaler
: Standardize features by removing the mean and scaling to unit variance.sklearn.preprocessing.RobustScaler
: Scale features using statistics that are robust to outliers.
Logistic Regression (without regularization)
We specified class_weight='balanced'
to account for the imbalanced dataset, which works by automatically assigning weights inversely proportional to class frequencies
clf = LogisticRegression(random_state=0, max_iter=1000, class_weight='balanced',solver='saga').fit(X_train_minmax_scaled, y_train)
y_pred = clf.predict(X_test_minmax_scaled)
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
The results is already pretty good:
precision recall f1-score support
0 0.98 0.98 0.98 916
1 0.87 0.90 0.88 167
accuracy 0.96 1083
macro avg 0.93 0.94 0.93 1083
weighted avg 0.96 0.96 0.96 1083
[[894 22]
[ 17 150]]
Using Regularization
To prevent overfitting, we can use regularization. To use regularization with logistic regression, we can just specify penalty='l1'
, penalty='l2'
, or penalty='elasticnet'
in the LogisticRegression
function.
The results are comparably good to the previous model.