Homework 3. Random Forests: Internet Advertisements

Introduction

For this assignment, we're diving into the Internet Advertisements dataset from the UCI Machine Learning Repository. This dataset features 3,279 samples of internet advertisements, each represented by 1,557 binary features. Additionally, there's a binary class label associated with each sample, which indicates whether the ad is malicious or benign. Our goal is to build a random forest classifier capable of predicting the nature of these ads.


Data Preprocessing

Load the Data

We begin by loading our dataset. Here's how:

internetAd = pd.read_csv('Internet_Ad_Data.csv', sep=',', error_bad_lines=False)

Handle Missing Values

Real-world data isn't perfect. Occasionally, you might find missing values. We replace these with the median of their respective columns.

internetAd.replace({"\s*\?\s*": np.nan}, regex=True, inplace=True)
internetAd.iloc[:, :-1] = internetAd.iloc[:, :-1].apply(pd.to_numeric, errors='coerce')
internetAd.iloc[:, :-1] = internetAd.iloc[:, :-1].apply(lambda x: x.fillna(x.median()), axis=0)

Splitting our Data

We'll split our data into training and test sets. The training set helps us build our model, and the test set lets us evaluate its performance.

from sklearn.model_selection import train_test_split

X = internetAd.iloc[:, :-1]
y = internetAd.iloc[:, -1] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Random Forest Classifier

Random forests are an ensemble learning method for classification. It works by constructing a multitude of decision trees at training time. It is used with grid search to find the optimal hyperparameters for the model.

parameters = {
    "max_depth": [2, 4],
    "min_samples_split": [0.05, 0.1, 0.2]
}

dtc_grid = GridSearchCV(RandomForestClassifier(), parameters)
dtc_grid.fit(X_train, y_train)

Using Grid Search with Random Forest (or any other model) helps to fine-tune its hyperparameters to get the best performance. Let's dive deeper into your question:

Why use Grid Search with Random Forest?

Random Forest, even though it's relatively robust to overfitting, has several hyperparameters that can be adjusted to optimize its performance. The main ones in your provided code are max_depth and min_samples_split. Here's a brief on these:

  1. max_depth: This parameter determines the maximum depth of each tree. A deep tree can capture more information about the data, but it's also more likely to overfit, especially with a small dataset. By controlling the depth, you can manage the trade-off between bias and variance.

  2. min_samples_split: This parameter controls the minimum number of samples required to make a split at a node. A smaller value will allow the trees to make more fine-grained splits, which again can capture more information but also risk overfitting.

By using Grid Search over these parameters, you can systematically explore combinations of max_depth and min_samples_split to find the combination that provides the best cross-validated performance on your training data.

Default Tree used by RandomForest:

RandomForestClassifier in scikit-learn uses Decision Trees as its base learner. The decision trees are fully grown unless you set constraints like max_depth or min_samples_split.

The default settings for the tree in RandomForestClassifier are: - max_depth=None (nodes are expanded until all leaves contain less than min_samples_split samples) - min_samples_split=2 (the minimum number of samples required to split an internal node) - min_samples_leaf=1 (the minimum number of samples required to be at a leaf node)

These defaults mean that, in the absence of other constraints, the trees in the random forest will grow as deep as they can for each bootstrap sample. This might sound like it would lead to overfitting, but because Random Forest aggregates predictions from multiple trees (and each tree sees only a bootstrap sample of the data and a subset of features), the model remains robust and often doesn't overfit easily. However, for some datasets or problems, tuning the parameters can yield better results.

Predictions and Performance Metrics

After training, it's prediction time! We'll also evaluate how well our model performs.

test_z = dtc_grid.predict(X_test)
test_z_prob = dtc_grid.predict_proba(X_test)[:, 1]

accuracy = accuracy_score(y_test, test_z)
roc_auc = roc_auc_score(y_test, test_z_prob)

print(f"Accuracy = {accuracy}")
print(f"ROC AUC = {roc_auc}")

The random forest classifier achieved an accuracy of 0.905 and a ROC AUC of 0.954.


ExtraTrees Classifier

ExtraTrees is a variant of the random forest classifier. It uses a random subset of features and thresholds for each feature to split the data. This randomness helps reduce variance and improve performance.

Let's tune the parameters for the extra trees classifier.

parameters = {
    "max_depth": [2, 4],
    "min_samples_split": [0.05, 0.1, 0.2]
}

etc_grid = GridSearchCV(ExtraTreesClassifier(), parameters)
etc_grid.fit(X_train, y_train)

Make Predictions and Evaluate

Using our trained classifier, we predict and evaluate its performance.

test_z = etc_grid.predict(X_test)
test_z_prob = etc_grid.predict_proba(X_test)[:, 1]

accuracy = accuracy_score(y_test, test_z)
roc_auc = roc_auc_score(y_test, test_z_prob)

print(f"Accuracy = {accuracy}")
print(f"ROC AUC = {roc_auc}")

Results show an accuracy of 0.887 and a ROC AUC of 0.929.


Gradient Boosting Classifier

Gradient Boosting is a machine learning technique that produces a prediction model in the form of an ensemble of weak prediction models. It builds the model in a stage-wise fashion, using a weak learner to make predictions and then boosting the error of the previous model. This process is repeated until the error is minimized or a maximum number of iterations is reached.

Parameter Fine-Tuning

Optimizing our gradient boosting model using grid search.

parameters = {
    "max_depth": [2, 4],
    "min_samples_split": [0.05, 0.1, 0.2]
}

gbc_grid = GridSearchCV(GradientBoostingClassifier(), parameters)
gbc_grid.fit(X_train, y_train)

Make Predictions and Assess

Let's see how our gradient boosting model fares.

test_z = gbc_grid.predict(X_test)
test_z_prob = gbc_grid.predict_proba(X_test)[:, 1]

accuracy = accuracy_score(y_test, test_z)
roc_auc = roc_auc_score(y_test, test_z_prob)

print(f"Accuracy = {accuracy}")
print(f"ROC AUC = {roc_auc}")

The accuracy stands at a remarkable 0.963, with a ROC AUC of 0.965.


Results Summary

Wrapping up, here's a quick look at our models and their performance:

Model Accuracy ROC AUC
Random Forest 0.905 0.954
Extra Trees 0.887 0.929
Gradient Boosting 0.963 0.965

The gradient boosting classifier takes the crown with the highest accuracy and ROC AUC, making it the top pick for this dataset.