Homework 2. Decision Trees: Banknote Authentication

Introduction

This assignment is based on Banknote Authentication from UCI Machine Learning Repository. The dataset contains 1,372 samples of genuine and forged banknotes. Each sample has four continuous features and a binary class label. The goal is to build a decision tree to classify the banknotes as either genuine or forged.

1. Decision Tree Helpers

This assignment is a bit different from the previous one. Instead of using the scikit-learn library, we'll be implementing a decision tree from scratch. The goal is to understand the inner workings of a decision tree and how it can be used for classification.

1.1 Gini Index Calculation

Calculate the Gini index for a dataset based on the class distribution:

def gini_index(groups: list, classes: list) -> float:
    total = sum(len(group) for group in groups)

    gini = 0.0
    for group in groups:
        size = len(group)
        if not size:
            continue
        score = sum([(group.count(class_val) / size)**2 for class_val in classes])
        gini += (1.0 - score) * (size / total)
    return gini

1.2 Dataset Split Based on Attribute Value

Split a dataset based on a specified attribute and its value:

def test_split(index: int, value: float, dataset: list) -> (list, list):
    left = [row for row in dataset if row[index] < value]
    right = [row for row in dataset if row[index] >= value]
    return left, right

1.3 Optimal Split Point Selection Using Gini Index

Determine the best attribute and value to split the dataset on:

def get_split(dataset: list) -> dict:
    class_values = list(set(row[-1] for row in dataset))
    b_index, b_value, b_score, b_groups = float('inf'), float('inf'), float('inf'), None

    for index in range(len(dataset[0]) - 1):
        for row in dataset:
            groups = test_split(index, row[index], dataset)
            gini = gini_index(groups, class_values)
            if gini < b_score:
                b_index, b_value, b_score, b_groups = index, row[index], gini, groups

    return {
        'index': b_index, 
        'value': b_value, 
        'groups': b_groups
    }

1.4 Optimal Split Point Selection Using Entropy

This is an alternative method to the previous one but uses entropy instead of Gini Index:

def get_split_entropy(dataset: List[List[Union[int, float]]]) -> Dict[str, Any]:
    class_values = list(set(row[-1] for row in dataset))
    b_index, b_value, b_score, b_groups = float('inf'), float('inf'), float('inf'), None

    for index in range(len(dataset[0]) - 1):
        for row in dataset:
            groups = test_split(index, row[index], dataset)
            entropy = calculate_entropy(groups, class_values)
            if entropy < b_score:
                b_index, b_value, b_score, b_groups = index, row[index], entropy, groups

    return {
        'index': b_index, 
        'value': b_value, 
        'groups': b_groups
    }

def calculate_entropy(groups: List[List[Union[int, float]]], classes: List[Union[int, float]]) -> float:
    n_instances = float(sum([len(group) for group in groups]))
    entropy = 0.0

    for group in groups:
        size = float(len(group))
        if size == 0:
            continue
        score = 0.0
        for class_val in classes:
            p = [row[-1] for row in group].count(class_val) / size
            if p > 0:
                score -= p * math.log2(p)
        entropy += (score * size / n_instances)

    return entropy

2. Building the Decision Tree

2.1 Terminal Node Creation

Determine the class label for a group of rows:

def to_terminal(group):
    outcomes = [row[-1] for row in group]
    return max(set(outcomes), key=outcomes.count)

2.2 Recursive Data Splitting

Recursively split the data to construct the decision tree:

def split(node: Dict[str, Any], max_depth: int, min_size: int, depth: int) -> None:
    left, right = node['groups']
    del node['groups']

    # Check if there's no split.
    if not left or not right:
        terminal_value = to_terminal(left + right)
        node['left'] = node['right'] = terminal_value
        return

    # Check for max depth.
    if depth >= max_depth:
        node['left'] = to_terminal(left)
        node['right'] = to_terminal(right)
        return

    # Process left child.
    if len(left) <= min_size:
        node['left'] = to_terminal(left)
    else:
        node['left'] = get_split(left)
        split(node['left'], max_depth, min_size, depth + 1)

    # Process right child.
    if len(right) <= min_size:
        node['right'] = to_terminal(right)
    else:
        node['right'] = get_split(right)
        split(node['right'], max_depth, min_size, depth + 1)

2.3 Decision Tree Construction

Start building the tree from the root:

def build_tree(train, max_depth, min_size):
    root = get_split(train)
    split(root, max_depth, min_size, 1)
    return root

2.4 Prediction with the Decision Tree

Given a constructed decision tree and a data row, predict its class label:

def predict(node: dict, row: List[Any]) -> Any:
    if row[node['index']] < node['value']:
        next_node = node['left']
    else:
        next_node = node['right']

    if isinstance(next_node, dict):
        return predict(next_node, row)
    return next_node

3. Testing the Decision Tree Implementation on Real Data

3.1 Load and preprocess data

# Load a CSV file
def load_csv(filename):
    file = open(filename, "r")
    lines = reader(file)
    dataset = list(lines)
    return dataset

# Convert string column to float
def str_column_to_float(dataset, column):
    for row in dataset:
        row[column] = float(row[column].strip())

# convert string attributes to integers
def str_column_to_int(dataset, column):
    for row in dataset:
        row[column] = int(row[column].strip())

filename = 'data_banknote_authentication.csv'
dataset = load_csv(filename)

for i in range(4):
    str_column_to_float(dataset, i)
str_column_to_int(dataset, 4)
print(dataset)

train = dataset[1:np.int(len(dataset)*2/3)]
test = dataset[np.int(len(dataset)*2/3)+1:len(dataset)]

3.2 Build the decision tree and evaluate its accuracy

Build the decision tree and evaluate its accuracy:

tree = build_tree(train, max_depth=5, min_size=10)
predictions = [predict(tree, row) for row in test]

actual_labels = [row[-1] for row in test]
accuracy = accuracy_score(actual_labels, predictions)

print(f'Accuracy: {accuracy:.2%}')

The accuracy of the decision tree is 92.1%.