Homework 1: K-Nearest Neighbors (KNN): Social Network Ads

Introduction

In this homework, we will explore the Social Network Ads dataset. This dataset provides information on users within a social networking platform, specifically whether they made a purchase in response to an advertisement. The K-Nearest Neighbors (KNN) algorithm will be our tool of choice to create a predictive model. We aim to determine the likelihood of a user purchasing a product, based on two features: age and estimated salary.

KNN is particularly suitable for this task because it predicts the classification of a data point based on how its neighbors are classified. In the context of our dataset, this means determining if a user is likely to make a purchase by considering the behavior (purchase/no purchase) of users with similar age and salary profiles.

Encoding Categories

Before delving into the modeling, we need to ensure that our data is in the right format. As machine learning algorithms require numerical input, we'll convert gender categories from text ('Male' and 'Female') to numeric values (0 and 1, respectively).

import pandas as pd

SNA = pd.read_csv('./Social_Network_Ads.csv')
SNA_edited = SNA.replace(to_replace='Male', value=0)
SNA_edited = SNA_edited.replace(to_replace='Female', value=1)

It's important to check for class imbalance at this stage, as heavily imbalanced classes can skew our model's predictions. For instance, if one class heavily outnumbers the other, our model might become biased towards predicting the majority class.

Exploratory Data Analysis

Here we use the seaborn library to plot the following plots:

Scatterplot of Age vs. Salary, with purchase outcome indicated by color. This shows that users who purchased the product tend to be older and have higher salaries.
Scatterplot of Gender vs. Salary, with purchase outcome indicated by color. This shows the lack of relationship between gender and purchase outcome.
Histogram of Age, with purchase outcome indicated by color. This shows that users who purchased the product tend to be older.
Histogram of Salary, with purchase outcome indicated by color. This shows that users who purchased the product tend to have higher salaries.

Train-Test Split

To evaluate our model's performance effectively, we need to set aside a portion of our data. By doing this, we can train our model on one subset and test its performance on another, unseen subset. Here, we're allocating 70% of our data for training and reserving 30% for testing.

from sklearn.model_selection import train_test_split

# Split the dataset into attributes and labels
X = SNA_edited.iloc[:, :-1].values  
y = SNA_edited.iloc[:, 3].values  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

K-Nearest Neighbors

KNN predicts the class of a given data point based on the classes of its nearest neighbors.

from sklearn.neighbors import KNeighborsClassifier

# Instantiate the learning model with k set to 2
clf = KNeighborsClassifier(n_neighbors=2)

# Train the model using the training sets
clf.fit(X_train, y_train)

# Predict the responses for the test set
y_pred = clf.predict(X_test)

In the above code, we've introduced the typical semantics used by the scikit-learn library. A model is usually represented as a class. The .fit() method trains the model, while the .predict() method lets us make predictions using the trained model.

Evaluating the Model

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score  

print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

Initial Results

Confusion Matrix:

[[73  8]
 [15 24]]

True Positive (TP) = 24: 24 users were correctly predicted to purchase the product.
True Negative (TN) = 73: 73 users were correctly predicted not to purchase the product.
False Positive (FP) = 8: 8 users were wrongly predicted to purchase the product.
False Negative (FN) = 15: 15 users were wrongly predicted not to purchase the product.

Metrics:
- Accuracy (0.8083 or 80.83%): This means the model correctly predicted the outcome for about 81% of the test set. For every 100 predictions made, approximately 81 were correct.

Precision (for Class 1): 0.75 - When the model predicted a user would buy the product, it was correct about 75% of the time.
Recall (for Class 1): 0.62 - Out of the users that actually bought the product, the model was able to capture 62% of them.
F1-Score (for Class 1): 0.68 - This is the harmonic mean of precision and recall. An F1-Score close to 1 indicates better precision and recall balance, and our model has a decent balance at 0.68 for predicting purchases.
Support: There were 81 instances of the class 0 (no purchase) and 39 instances of class 1 (purchase) in the test set.

Observations and Conclusion:

Model Performance: The model performs reasonably well with an accuracy of 80.83%. This means it can be a good starting point, but improvements can certainly be made.
Class Imbalance: There's a noticeable imbalance in the test set with more non-purchases (81) compared to purchases (39). This might affect the performance of some metrics.
Precision vs. Recall: The model has a higher precision than recall for predicting purchases. This means while the model is quite reliable when it predicts a user will make a purchase, it misses a good chunk (38%) of actual purchasers.

Using other values of $k$ and distance metrics

To improve the model, we tried different values of $k$ in KNN to see if performance improves. Recall that $k$ is the number of neighbors used to predict the class of a data point. We'll use the same code as before, but with $k$ set to 2, 3, 4, 5, and 6.

For distance metric, we used the Euclidean (default), Minkowski, and Chebyshev distances. The Minkowski distance is a generalization of the Euclidean and Manhattan distances, and the Chebyshev distance is the maximum absolute difference between features.

The results are summarized below:

1. K-Value:

k=2: The accuracy ranged from 0.775 (with Chebyshev) to 0.808 (with both Minkowski and the default metric).
k=3: The accuracy ranged from 0.816 (with Chebyshev) to 0.825 (with both Minkowski and the default metric).
k=4: The accuracy ranged from 0.825 (with both Chebyshev and Minkowski) to 0.858 (with the default metric).
k=5: The accuracy ranged from 0.8 (with Chebyshev) to 0.866 (with both Minkowski and the default metric).

From this, k=5 is consistently better or as good as the other k-values across the different distance metrics, with the highest accuracy of 0.866.

2. Distance Metric:

Default (Euclidean): The accuracy ranged from 0.808 (k=2) to 0.866 (k=5).
Minkowski: The accuracy results were identical to the default metric (since the default is actually Euclidean, which is a specific case of Minkowski with p=2).
Chebyshev: The accuracy ranged from 0.775 (k=2) to 0.825 (k=4).

From this, the default (Euclidean) metric performs marginally better than the Chebyshev distance, especially when k=5.

Conclusion:

The best k-value is k=5 with an accuracy of 0.866.
The default (Euclidean) distance metric is slightly superior compared to the Chebyshev distance for this dataset.

Why?

K-value: As k increases, the decision boundary becomes smoother, which can be beneficial in avoiding overfitting on noise in the training data. However, if k is too large, the model may underfit. In this case, k=5 seems to provide a good balance.
Distance Metric: The choice of distance metric can change the way neighbors are prioritized. Euclidean distance (default in sklearn's KNN) is often effective for many datasets. The Chebyshev distance, which uses the maximum absolute difference between features, may not capture the relationships in the dataset as effectively as the Euclidean distance in this particular case. However, the choice of metric should ideally be based on domain knowledge and the nature of the data.

This is just a simple assignment, but it's important to note that there are many other ways to improve the model. For instance, we could try feature scaling, feature selection, and hyperparameter tuning. The choice of distance metric and k-value should also be validated with techniques like cross-validation for a comprehensive assessment.