Homework 6. Clustering: Superstore Transactions

Introduction

This assignment is based on Superstore Transactions from Kaggle. The dataset contains 4922 rows of transaction data from an online superstore for 4 years. The objective is to cluster the data into groups based on the total sales. We will implement our own K-Means clustering algorithm and compare it to the scikit-learn implementation.


Helper Functions (K-Means Clustering)

Label the Data

def find_label_of_closest(points: pd.DataFrame, cluster_centroids: pd.DataFrame) -> np.ndarray:
    """
    Assign each point to the nearest cluster centroid.

    Parameters:
    - points (pd.DataFrame): Data points with dimensions [Number of Points, Number of Dimensions].
    - cluster_centroids (pd.DataFrame): Cluster centroids with dimensions [Number of Clusters, Number of Dimensions].

    Returns:
    - np.ndarray: An array containing the index (label) of the closest cluster centroid for each data point.
    """

    num_clusters = cluster_centroids.shape[0]
    labels = np.zeros(points.shape[0], dtype=int)

    for idx, point in points.iterrows():
        distances = np.linalg.norm(cluster_centroids - point, axis=1)
        labels[idx] = np.argmin(distances)

    return labels

Calculate Centroids

def calculate_cluster_centroid(points: pd.DataFrame, labels: np.ndarray) -> pd.DataFrame:
    """
    Calculate the centroid for each cluster based on the given points and their corresponding labels.

    Parameters:
    - points (pd.DataFrame): Data points with dimensions [Number of Points, Number of Dimensions].
    - labels (np.ndarray): An array containing the label of the cluster to which each point belongs.

    Returns:
    - pd.DataFrame: A dataframe containing the centroids for each cluster.
    """

    unique_labels = np.unique(labels)
    num_dimensions = points.shape[1]
    cluster_centroids = pd.DataFrame(np.nan, index=unique_labels, columns=range(num_dimensions))

    for label in unique_labels:
        cluster_centroids.loc[label] = points[labels == label].mean(axis=0)

    return cluster_centroids

Put it together

import numpy as np
import pandas as pd

def k_means(points: pd.DataFrame, initial_centroids: pd.DataFrame) -> (np.ndarray, pd.DataFrame):
    """
    Perform k-means clustering algorithm on a set of data points.

    Parameters:
    - points (pd.DataFrame): The set of data points to cluster.
    - initial_centroids (pd.DataFrame): Initial guess for the cluster centroids.

    Returns:
    - tuple(np.ndarray, pd.DataFrame): A tuple containing two elements:
        1. An array of labels indicating the cluster to which each data point belongs.
        2. A dataframe containing the final cluster centroids.
    """

    centroids = initial_centroids.copy()
    previous_labels = None

    # Get the starting set of labels
    labels = find_label_of_closest(points, centroids)

    while not np.array_equal(labels, previous_labels):
        centroids = calculate_cluster_centroid(points, labels)
        previous_labels = labels.copy()
        labels = find_label_of_closest(points, centroids)

    return labels, centroids

Data Preprocessing

Loading the Data

StoreTxn = pd.read_csv("./Superstore Transaction data.csv")

Convert to datetime

StoreTxn['Order Date'] = pd.to_datetime(StoreTxn['Order Date'] )

Extracting Features

This code segment aims to derive insights from transaction data for better customer understanding and segmentation. It focuses on the RFM (Recency, Frequency, Monetary) metrics. By understanding how recently a customer shopped, how often they shop, and how much they spend, businesses can customize their offerings and interactions to better cater to individual customer needs, potentially improving loyalty and increasing sales.

# This section aims to perform customer segmentation based on their behavior, specifically focusing on three dimensions: recency, frequency, and monetary value (RFM). By segmenting the customers based on these dimensions, businesses can target their customers more effectively, tailor their marketing efforts, and identify potential areas of improvement or opportunity.

# First, we're aggregating the transaction data to understand the total sales and quantity for each customer on any given day.
txn_agg = StoreTxn.groupby(['Customer ID', 'Order Date'])[['Sales', 'Quantity']].sum().reset_index()

# Next, we extract recency, frequency, and monetary metrics. Recency indicates how recently a customer has made a purchase, frequency reflects how often they buy, and monetary value shows how much they spend.
# Using a 7-day moving window for frequency and monetary ensures we capture the most recent transaction patterns, which can be more indicative of a customer's current behavior.

# Computing recency:
# Here, we're finding out the number of days since the last visit for each customer.
last = txn_agg.copy()
last['last_visit_ndays'] = last.groupby('Customer ID')['Order Date'].diff().dt.days
# We drop the 'Sales' and 'Quantity' columns as they're not needed for the recency calculation.
last.drop(['Sales', 'Quantity'], axis=1, inplace=True)
print(last.head(10), end='\n\n')

# Computing frequency and monetary value:
# We use a rolling window of 7 days to calculate the total sales and quantity for each customer.
roll = txn_agg.copy()
roll.set_index('Order Date', inplace=True)
roll = roll.groupby('Customer ID').rolling('7D')[['Quantity', 'Sales']].sum().reset_index()
roll.rename(columns={'Quantity': 'quantity_roll_sum_7D', 'Sales': 'sales_roll_sum_7D'}, inplace=True)
print(roll.head(10), end='\n\n')

In the provided code, the Recency, Frequency, and Monetary (RFM) metrics for customers are derived as follows:

  1. Recency (last_visit_ndays):
  2. last_visit_ndays: This metric indicates the number of days since the last visit (or purchase) for each customer. It's calculated using the .diff() method on the Order Date, which computes the difference between the current and previous transaction date for each customer.

  3. Frequency (quantity_roll_sum_7D):

  4. quantity_roll_sum_7D: This metric represents the total number of items (quantity) bought by a customer within a rolling window of 7 days. It gives a short-term frequency measure for each customer.

  5. Monetary (sales_roll_sum_7D):

  6. sales_roll_sum_7D: This metric captures the total sales (or spending) by a customer within a rolling window of 7 days. It gives a short-term monetary measure for each customer.

These metrics can be found in the derived dataframes last (for Recency) and roll (for Frequency and Monetary).

We can combine these metrics into a single dataframe for further analysis.

txn_roll = roll.join(last, how='inner',) # Inner join between roll (frequency and monetary fields) and last (recency fields) to create churn_roll.  Join based on index which works given that both dataframes are sorted by user_id and date.

print(txn_roll.dtypes, end='\n\n')
txn_roll.head(10)

Replace missing values

txn_roll['last_visit_ndays'] = txn_roll['last_visit_ndays'].fillna(pd.Timedelta('1000D')) # Replace missing recency values with 1000 days

Aggregate by Customer ID

txn_rfm = txn_agg.merge(txn_roll, on = ['Customer ID', 'Order Date']) 

K-Means Clustering

Our own implementation

txn_rfm[['last_visit_ndays_x', 'quantity_roll_sum_7D_x', 'sales_roll_sum_7D_x']] = txn_rfm[['last_visit_ndays_x', 'quantity_roll_sum_7D_x', 'sales_roll_sum_7D_x']].apply(pd.to_numeric, errors='coerce')
txn_rfm = txn_rfm.fillna(0)
Points = txn_rfm[['last_visit_ndays_x', 'quantity_roll_sum_7D_x', 'sales_roll_sum_7D_x']].values
ClusterCentroidGuesses = Points[np.random.choice(Points.shape[0], size=3, replace=False)]

Labels, ClusterCentroids = KMeans(Points, ClusterCentroidGuesses)
plt.figure(figsize=(10, 8))
plt.scatter(Points[Labels == 0, 0], Points[Labels == 0, 1], s = 50, c = 'red', label = 'Cluster 1')
plt.scatter(Points[Labels == 1, 0], Points[Labels == 1, 1], s = 50, c = 'blue', label = 'Cluster 2')
plt.scatter(Points[Labels == 2, 0], Points[Labels == 2, 1], s = 50, c = 'green', label = 'Cluster 3')
plt.title('Clusters of customers')
plt.xlabel('last_visit_ndays_x')
plt.ylabel('quantity_roll_sum_7D_x')
plt.legend()
plt.show()

k_mean_diy

scikit-learn implementation

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 3, init = 'k-means++', random_state = 42)
kmeans.fit(Points)