Homework 2. Decision Trees: Banknote Authentication
Introduction
This assignment is based on Banknote Authentication from UCI Machine Learning Repository. The dataset contains 1,372 samples of genuine and forged banknotes. Each sample has four continuous features and a binary class label. The goal is to build a decision tree to classify the banknotes as either genuine or forged.
1. Decision Tree Helpers
This assignment is a bit different from the previous one. Instead of using the scikit-learn
library, we'll be implementing a decision tree from scratch. The goal is to understand the inner workings of a decision tree and how it can be used for classification.
1.1 Gini Index Calculation
Calculate the Gini index for a dataset based on the class distribution:
def gini_index(groups: list, classes: list) -> float:
total = sum(len(group) for group in groups)
gini = 0.0
for group in groups:
size = len(group)
if not size:
continue
score = sum([(group.count(class_val) / size)**2 for class_val in classes])
gini += (1.0 - score) * (size / total)
return gini
1.2 Dataset Split Based on Attribute Value
Split a dataset based on a specified attribute and its value:
def test_split(index: int, value: float, dataset: list) -> (list, list):
left = [row for row in dataset if row[index] < value]
right = [row for row in dataset if row[index] >= value]
return left, right
1.3 Optimal Split Point Selection Using Gini Index
Determine the best attribute and value to split the dataset on:
def get_split(dataset: list) -> dict:
class_values = list(set(row[-1] for row in dataset))
b_index, b_value, b_score, b_groups = float('inf'), float('inf'), float('inf'), None
for index in range(len(dataset[0]) - 1):
for row in dataset:
groups = test_split(index, row[index], dataset)
gini = gini_index(groups, class_values)
if gini < b_score:
b_index, b_value, b_score, b_groups = index, row[index], gini, groups
return {
'index': b_index,
'value': b_value,
'groups': b_groups
}
1.4 Optimal Split Point Selection Using Entropy
This is an alternative method to the previous one but uses entropy instead of Gini Index:
def get_split_entropy(dataset: List[List[Union[int, float]]]) -> Dict[str, Any]:
class_values = list(set(row[-1] for row in dataset))
b_index, b_value, b_score, b_groups = float('inf'), float('inf'), float('inf'), None
for index in range(len(dataset[0]) - 1):
for row in dataset:
groups = test_split(index, row[index], dataset)
entropy = calculate_entropy(groups, class_values)
if entropy < b_score:
b_index, b_value, b_score, b_groups = index, row[index], entropy, groups
return {
'index': b_index,
'value': b_value,
'groups': b_groups
}
def calculate_entropy(groups: List[List[Union[int, float]]], classes: List[Union[int, float]]) -> float:
n_instances = float(sum([len(group) for group in groups]))
entropy = 0.0
for group in groups:
size = float(len(group))
if size == 0:
continue
score = 0.0
for class_val in classes:
p = [row[-1] for row in group].count(class_val) / size
if p > 0:
score -= p * math.log2(p)
entropy += (score * size / n_instances)
return entropy
2. Building the Decision Tree
2.1 Terminal Node Creation
Determine the class label for a group of rows:
def to_terminal(group):
outcomes = [row[-1] for row in group]
return max(set(outcomes), key=outcomes.count)
2.2 Recursive Data Splitting
Recursively split the data to construct the decision tree:
def split(node: Dict[str, Any], max_depth: int, min_size: int, depth: int) -> None:
left, right = node['groups']
del node['groups']
# Check if there's no split.
if not left or not right:
terminal_value = to_terminal(left + right)
node['left'] = node['right'] = terminal_value
return
# Check for max depth.
if depth >= max_depth:
node['left'] = to_terminal(left)
node['right'] = to_terminal(right)
return
# Process left child.
if len(left) <= min_size:
node['left'] = to_terminal(left)
else:
node['left'] = get_split(left)
split(node['left'], max_depth, min_size, depth + 1)
# Process right child.
if len(right) <= min_size:
node['right'] = to_terminal(right)
else:
node['right'] = get_split(right)
split(node['right'], max_depth, min_size, depth + 1)
2.3 Decision Tree Construction
Start building the tree from the root:
def build_tree(train, max_depth, min_size):
root = get_split(train)
split(root, max_depth, min_size, 1)
return root
2.4 Prediction with the Decision Tree
Given a constructed decision tree and a data row, predict its class label:
def predict(node: dict, row: List[Any]) -> Any:
if row[node['index']] < node['value']:
next_node = node['left']
else:
next_node = node['right']
if isinstance(next_node, dict):
return predict(next_node, row)
return next_node
3. Testing the Decision Tree Implementation on Real Data
3.1 Load and preprocess data
# Load a CSV file
def load_csv(filename):
file = open(filename, "r")
lines = reader(file)
dataset = list(lines)
return dataset
# Convert string column to float
def str_column_to_float(dataset, column):
for row in dataset:
row[column] = float(row[column].strip())
# convert string attributes to integers
def str_column_to_int(dataset, column):
for row in dataset:
row[column] = int(row[column].strip())
filename = 'data_banknote_authentication.csv'
dataset = load_csv(filename)
for i in range(4):
str_column_to_float(dataset, i)
str_column_to_int(dataset, 4)
print(dataset)
train = dataset[1:np.int(len(dataset)*2/3)]
test = dataset[np.int(len(dataset)*2/3)+1:len(dataset)]
3.2 Build the decision tree and evaluate its accuracy
Build the decision tree and evaluate its accuracy:
tree = build_tree(train, max_depth=5, min_size=10)
predictions = [predict(tree, row) for row in test]
actual_labels = [row[-1] for row in test]
accuracy = accuracy_score(actual_labels, predictions)
print(f'Accuracy: {accuracy:.2%}')
The accuracy of the decision tree is 92.1%.