Lecture 3. Tree Emsemble Methods

Date: 2023-05-04

1. Overview

1.1 Introduction to Ensemble Methods

Ensemble methods combine the output of multiple models (often decision trees) to deliver a superior result. The collective intelligence of multiple models can capture more nuances and reduce both bias and variance in predictions, leading to more robust and accurate models.

1.2 Bagging

Bagging, or Bootstrap Aggregating, is all about creating multiple versions of a dataset by random sampling with replacement (i.e., bootstrapping). For each of these datasets, a decision tree is trained. The final output is an average (for regression problems) or a majority vote (for classification tasks) of the predictions from all the trees. By doing this, bagging helps in reducing the variance of the model without increasing the bias, leading to a more stable prediction.

1.3 Random Forests

Random Forests take the concept of bagging a step further. While they also use bootstrapped datasets to train multiple trees, there's an additional twist. When determining the best split at each node, only a random subset of the features is considered. This randomness ensures the trees are less correlated, which reduces variance even more than simple bagging. The results from all trees are then averaged or voted upon, similar to bagging.

1.4 Boosting

While bagging and random forests create trees in parallel and combine their results, boosting builds trees sequentially. Each tree tries to correct the errors made by the previous one. Instead of relying on simple averaging or voting, boosting gives more weight to the data points that were misclassified by earlier trees. Over time, this method gives rise to a strong model even if the base learners are weak (like shallow trees). Boosting often leads to better predictive performance by focusing on reducing bias.

1.5 Bayesian Additive Regression Trees

Bayesian Additive Regression Trees, or BART, is an ensemble method that merges the ideas of boosting with Bayesian probability. Trees are built iteratively, similar to boosting, but the approach is probabilistic. Using Markov Chain Monte Carlo (MCMC) techniques, BART defines a posterior distribution over potential decision trees. This not only results in an ensemble of trees but also provides a measure of uncertainty about predictions, which is a significant advantage in many applications.

Summary

Method	Key Characteristic	Focus	Parallel/Sequential	Provides Uncertainty Estimates?
Bagging	Multiple bootstrapped datasets	Reduce Variance	Parallel	No
Random Forests	Random subset of features for splits	Reduce Variance	Parallel	No
Boosting	Sequential trees correcting errors	Reduce Bias	Sequential	No
BART	Bayesian approach with MCMC	Boosting with uncertainty estimates	Sequential	Yes

2. Bagging

2.1 Introduction

Bagging, which stands for Bootstrap Aggregating, is a powerful ensemble method designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It works by creating multiple sets of data by bootstrapping (random sampling with replacement) and then training a separate decision tree on each dataset. Finally, it aggregates the results to produce a single output.

2.2 Bagging for Regression

Let's break down the process of bagging for regression:

Bootstrap Sampling:
- From a dataset of size $N$ , create $M$ new datasets by random sampling with replacement. Each of these datasets will also be of size $N$ .
- Notably, some data points may be repeated, and some might not be included at all.
Train Separate Models:
- For each bootstrapped dataset, train a regression tree (or any other regressor).
Prediction:
- For a new input $x$ , make a prediction using every model.
- Mathematically, if $f_i(x)$ is the prediction of the $i^{th}$ tree for input $x$ , the final prediction $F(x)$ is the average of all predictions: $F(x) = \frac{1}{M} \sum_{i=1}^{M} f_i(x)$

2.3 Bagging for Classification

Bagging for classification follows a similar methodology, but the final prediction is based on a majority vote:

Bootstrap Sampling:
- As before, create $M$ new datasets from the original dataset by random sampling with replacement.
Train Separate Models:
- For each bootstrapped dataset, train a classification tree (or any other classifier).
Prediction:
- For a new input $x$ , make a prediction using every model.
- If we consider a binary classification (classes A and B), and out of models, models predict class A and models predict class B, then:
  - If $m_A > m_B$ , predict class A
  - Otherwise, predict class B
- For multi-class classification, the class with the highest count among the $M$ predictions is chosen.

The strength of bagging arises from its ability to reduce variance by aggregating multiple models. By bootstrapping the data and training separate models, bagging effectively captures the diversity in the dataset and minimizes the chances of overfitting.

3. Random Forests

3.1 How is Random Forest Different from Bagging?

Random Forest is a natural extension of the bagging technique but introduces an extra layer of randomness that sets it apart. Here's how it differentiates:

Feature Selection:
- In bagging, when deciding the split for a node in the decision tree, all features are considered to find the best split.
In contrast, Random Forest considers only a random subset of the features for this decision. This randomness ensures that each tree in the forest is different, thereby making the ensemble more diverse.
Decorrelation of Trees:
- The random subset of features ensures that the individual trees are less correlated with each other. Less correlation means that the trees' errors are less likely to overlap, making the ensemble's output more robust.
Hyperparameters:
- Random Forest introduces hyperparameters like the number of features selected at each node. These can be fine-tuned to optimize the model further.

In essence, while both bagging and Random Forest use bootstrapped datasets and aggregation of results, the latter's added randomness in feature selection leads to a more diversified and often better-performing ensemble.

4. Boosting

4.1 Introduction to Boosting

Boosting is an ensemble method that's focused on improving the performance of models by turning weak learners into strong ones. The main philosophy behind boosting is iterative: instead of training models independently and aggregating their results, models are trained sequentially with each new model focusing on the errors made by the previous ones.

4.2 Core Concept

Weighted Data Points:
- Initially, all data points are given equal weights. As the boosting process progresses, weights of misclassified data points increase, ensuring that subsequent models focus more on them.
Sequential Training:
- A model is trained and its errors are identified. The next model in the sequence specifically targets these errors by giving them more importance.
Model Weighting:
- Each model is assigned a weight based on its accuracy. More accurate models have greater influence in the final decision.

4.3 Algorithms under the Boosting Umbrella

Got it. Let me provide the algorithms for both AdaBoost's regression and Gradient Boosting's classification using the format you've shared:

Adaptive Boosting (AdaBoost):

Algorithm (Classification)
- Initialize each data point's weight $w_i$ to $\frac{1}{N}$ where $N$ is the number of data points.
- For to (number of boosting rounds):
  - Fit a classifier $C_t$ to the data using weights $w_i$ .
  - Calculate the weighted error $\epsilon_t$ as $\frac{\sum_{i=1}^{N} w_i I(y_i \neq C_t(x_i))}{\sum_{i=1}^{N} w_i}$ where $I$ is the indicator function.
  - Calculate the classifier's weight $\alpha_t = \frac{1}{2} \ln \left( \frac{1 - \epsilon_t}{\epsilon_t} \right)$ .
  - Update data point weights: $w_i \leftarrow w_i \times e^{-\alpha_t y_i C_t(x_i)}$ for all $i$ .
  - Normalize weights so they sum up to 1.
- Final model: $F(x) = \text{sign} \left( \sum_{t=1}^{T} \alpha_t C_t(x) \right)$
Algorithm (Regression: AdaBoost.R2)
- Start with equal weights for all training instances.
- For to (number of boosting rounds):
  - Fit a model $M_t$ to the training data using weights.
  - Calculate the predicted values for each instance.
  - For each instance, compute the absolute error $|y_i - \hat{y}_i^t|$ .
  - For each instance, compute the weighted error $\epsilon_i^t = \frac{|y_i - \hat{y}_i^t|}{y_{max} - y_{min}}$ .
  - Compute the average weighted error $\epsilon_t = \sum w_i \epsilon_i^t$ .
  - Calculate the model's weight $\alpha_t = \frac{1}{2} \ln \left( \frac{1 - \epsilon_t}{\epsilon_t} \right)$ .
  - Update instance weights: $w_i \leftarrow w_i \times e^{-\alpha_t(1-\epsilon_i^t)}$ if prediction is correct, or $w_i \leftarrow w_i \times e^{\alpha_t \epsilon_i^t}$ if prediction is incorrect.
  - Normalize instance weights to sum up to 1.
- Final model: $F(x) = \sum_{t=1}^{T} \alpha_t M_t(x)$
Gradient Boosting:

Algorithm (Regression)
- Initialize model with a constant value: $F_0(x) = \text{argmin}_{\gamma} \sum_{i=1}^{N} L(y_i, \gamma)$ where $L$ is the loss function.
- For to :
  - Compute the pseudo-residuals: $r_{it} = -\frac{\partial L(y_i, F_{t-1}(x_i))}{\partial F_{t-1}(x_i)}$ for all $i$ .
  - Fit a decision tree $h_t(x)$ to the pseudo-residuals.
  - Choose a step size $\eta$ using line search: $\eta = \text{argmin}_{\eta} \sum_{i=1}^{N} L\left(y_i, F_{t-1}(x_i) + \eta h_t(x_i)\right)$ .
  - Update the model: $F_t(x) = F_{t-1}(x) + \eta h_t(x)$ .
Algorithm (Classification)
- Initialize with a constant predictor: $F_0(x) = \text{argmin}_{\gamma} \sum_{i=1}^{N} L(y_i, \gamma)$ .
- For to :
  - Calculate the negative gradient (pseudo-residuals) for the current model: $r_{it} = -\frac{\partial L(y_i, F_{t-1}(x_i))}{\partial F_{t-1}(x_i)}$ for all $i$ .
  - Fit a decision tree $h_t(x)$ to these pseudo-residuals.
  - Choose a step size $\eta$ using line search: $\eta = \text{argmin}_{\eta} \sum_{i=1}^{N} L\left(y_i, F_{t-1}(x_i) + \eta h_t(x_i)\right)$ .
  - Update the model: $F_t(x) = F_{t-1}(x) + \eta h_t(x)$ .
- For binary classification, the output can be converted to class probabilities using the logistic function, and the class with the highest probability is the prediction. For multi-class classification, a softmax function can be used.
XGBoost, LightGBM, CatBoost:

These algorithms are advanced implementations of gradient boosting with optimizations for speed, accuracy, and other features. Let's focus on XGBoost as an example:

XGBoost Algorithm - Objective function combines loss and a regularization term: $\text{obj}(\theta) = \sum_{i=1}^{N} l(y_i, \hat{y}_i) + \sum_{j=1}^{T} \Omega(f_j)$ , where $\Omega$ is the regularization term and $f_j$ is the j-th tree. - Regularization helps control the complexity of individual trees. The regularization term in XGBoost is defined as: $\Omega(f) = \gamma T + \frac{1}{2} \lambda \| w \|^2$ where $T$ is the number of leaves in the tree and $w$ is the vector of scores on the leaves. - XGBoost utilizes second-order approximations to find the best split points and leaf scores, making it more accurate than traditional gradient boosting.

Note: AdaBoost was primarily presented in the context of classification. The initial version of AdaBoost (often referred to as AdaBoost.M1) was specifically designed for binary classification problems. On the other hand, Gradient Boosting was initially described in the context of regression, where it tackled the problem by fitting new trees to the residuals (the difference between the predicted and actual values) of the previous tree.

4.4 Strengths and Weaknesses

Strengths:

Performance: Often provides superior predictive accuracy.
Versatility: Can be used for both classification and regression tasks.
Handles Imbalance: Can deal with imbalanced datasets by adjusting weights.

Weaknesses:

Overfitting: Without proper tuning, boosting can overfit on noisy data.
Computational Cost: Training models sequentially can be more time-consuming than parallel methods like bagging.

4.5 Conclusion

Boosting is a potent method in the ensemble toolkit. By building on the errors of previous models, it can achieve impressive accuracy even when the base learners are relatively simple. However, its strength hinges on proper parameter tuning and monitoring to avoid overfitting.

5. Bayesian Additive Regression Trees (BART)

5.1 Introduction

Bayesian Additive Regression Trees, commonly known as BART, is a nonparametric Bayesian regression approach that uses decision trees as its base learners. The "additive" aspect comes from the fact that predictions are constructed as a sum of decision tree outputs, similar to boosting. However, BART's distinction lies in its Bayesian nature: rather than iteratively refining predictions based on residuals (like boosting), BART places prior distributions over model parameters and uses Markov Chain Monte Carlo (MCMC) sampling to make posterior inferences.

5.2 Key Concepts of BART

Tree-based modeling: Like other ensemble methods, BART uses decision trees, specifically shallow decision trees (typically with a depth of 1 to 3). These trees partition the data into homogeneous groups but don't necessarily provide a highly accurate prediction on their own.
Additive modeling: Predictions are made by summing up the results from multiple trees. This is reminiscent of boosting, but in BART, the trees are built simultaneously with considerations to Bayesian priors.
Bayesian Priors: Priors are placed on tree structures, the parameters within the trees, and the number of trees. This regularization ensures that no single tree dominates the prediction and promotes diversity among the ensemble of trees.

5.3 Mathematical Overview

Given data $(y_i, x_i)$ where $y_i$ is the response and $x_i$ is a vector of predictors for $i = 1, ..., N$ , the BART model can be expressed as:

$y_i = \sum_{j=1}^{m} g(x_i; T_j, M_j) + \epsilon_i$

Where: - $m$ is the number of trees. - $g(x_i; T_j, M_j)$ is the function of the $j^{th}$ tree $T_j$ with corresponding tree parameters $M_j$ . - $\epsilon_i \sim N(0, \sigma^2)$ is the noise.

The main goal is to infer the posterior distribution of the tree structures $T_j$ , the parameters $M_j$ , and the noise variance $\sigma^2$ given the data.

To achieve this, BART employs MCMC techniques, notably:

Metropolis-Hastings to propose changes to the tree structures.
Gibbs sampling to update the tree parameters and noise variance.

By performing these sampling steps iteratively, BART can generate a diverse set of trees that collaboratively produce robust predictive performance.

5.4 Advantages of BART

Regularization: The Bayesian priors act as a strong regularizer, preventing overfitting, which is a common problem in tree-based models.
Flexibility: BART can capture non-linear and interaction effects without explicitly specifying them.
Uncertainty Quantification: Being Bayesian in nature, BART not only provides point estimates but also uncertainty intervals around its predictions.

6. Q&A

1. What are ensemble methods in machine learning and why are they used?

Answer: Ensemble methods combine multiple models to improve overall performance. They are used to enhance prediction accuracy, reduce overfitting, and ensure robustness against various data distributions.

2. How does bagging reduce the variance of a model?

Answer: Bagging (Bootstrap Aggregating) involves creating multiple datasets through bootstrapping (sampling with replacement) and training a model on each dataset. The predictions from these models are averaged (for regression) or voted upon (for classification). This process reduces the variance by averaging out individual model inconsistencies.

3. What makes Random Forest different from plain bagging with decision trees?

Answer: Random Forest introduces an extra layer of randomness. In addition to bootstrapping datasets, during each split in the tree-building process, a random subset of features is selected. This ensures that individual trees are de-correlated, further enhancing model diversity and reducing overfitting.

4. In boosting, why are models built sequentially rather than in parallel like in bagging?

Answer: Boosting focuses on correcting the errors of previous models. Each new model in the sequence specifically targets the misclassified or high-residual data points from the preceding model. This sequential process allows boosting algorithms to improve upon areas where prior models struggled.

5. What's the difference between AdaBoost and Gradient Boosting?

Answer: While both are boosting techniques, AdaBoost primarily adjusts instance weights to focus on misclassifications, and the final prediction is a weighted vote. Gradient Boosting, on the other hand, fits new models to the residuals of the previous model and updates predictions using a gradient descent algorithm.

6. How does BART differ from traditional boosting methods?

Answer: BART is Bayesian and non-parametric. Instead of refining predictions based on residuals, BART places prior distributions over model parameters and uses MCMC sampling for posterior inferences. The predictions come from a sum of decision tree outputs, but these trees are built with Bayesian considerations.

7. In the context of Random Forest, what is feature importance and how is it determined?

Answer: Feature importance quantifies how useful or valuable each feature was in the construction of the random forest. It’s determined by observing how much the tree nodes that use a particular feature reduce impurity on average (e.g., using Gini impurity or entropy). Features that frequently split nodes and create purer nodes have higher importance.

8. What's the main risk of using too many boosting iterations in a boosting algorithm?

Answer: Too many boosting iterations can lead to overfitting. As the algorithm continues to correct errors and reduce residuals, it may start to capture noise or outliers in the data, leading to decreased generalization to new, unseen data.

9. How do ensemble methods help in reducing model bias?

Answer: Ensemble methods, particularly boosting, focus on areas where the base learners are underperforming. By giving more emphasis to these harder-to-predict instances, the overall model can correct its biases and improve its accuracy.

10. Are ensemble methods always better than single models?

Answer: Not always. While ensemble methods can provide significant improvements in accuracy and robustness, they come with increased complexity and computational costs. For some datasets or tasks, a well-tuned single model might perform just as well, if not better.

11. Why might one choose to use a simple model over an ensemble method?

Answer: Simplicity can be beneficial in terms of interpretability, computational speed, and ease of deployment. If a simple model achieves satisfactory performance, it might be preferred over an ensemble due to its transparency and reduced computational requirements.

12. What role does diversity among base learners play in ensemble methods?

Answer: Diversity ensures that individual learners capture different aspects or patterns in the data. When predictions from diverse learners are combined, it reduces the risk of repeating the same mistakes across the ensemble, leading to improved generalization and reduced error.

13. How does bagging handle overfitting in decision trees?

Answer: Bagging reduces overfitting by averaging predictions from multiple decision trees trained on different subsets of data. This averaging process diminishes the noise and reduces the variance, leading to a more robust and generalized model.

14. In boosting, how are data points that are hard to predict treated in subsequent iterations?

Answer: Boosting gives more weight or emphasis to data points that were misclassified or had higher residuals in previous iterations. This ensures that subsequent models in the sequence focus more on these challenging data points, trying to correct the errors made.

15. For ensemble methods that involve random processes, like Random Forest, how do we ensure consistency in results?

Answer: To ensure consistency, one can set a random seed before training. A random seed ensures that the "random" processes are reproducible and yield consistent results each time the model is run with that specific seed.

16. What is "stacking" in the context of ensemble methods?

Answer: Stacking involves training multiple different models and using their predictions as inputs to another "meta-model" or "stacker." The meta-model then makes the final prediction based on these inputs. Stacking effectively combines the strengths of various models to improve overall performance.

17. How does the depth of trees in BART compare to those in boosting or Random Forests?

Answer: In BART, trees are typically shallow, often with a depth of 1 to 3. This contrasts with boosting, where trees might be deeper to capture intricate patterns in residuals, or Random Forests, where the depth can vary depending on the parameter settings.

18. Are ensemble methods computationally more expensive than individual models? Why or why not?

Answer: Yes, ensemble methods are generally more computationally expensive. This is because they involve training multiple base learners, which individually require their own computational resources. The process of combining predictions from all these learners also adds to the computational cost.

19. In Random Forests, is it necessary to always use decision trees as base learners?

Answer: While Random Forests traditionally use decision trees as base learners, the underlying idea of bootstrapping samples and aggregating predictions can theoretically be applied to other models. However, decision trees are most commonly used because of their ability to capture complex relationships without requiring extensive parameter tuning.

20. How do ensemble methods handle missing data?

Answer: Handling missing data depends on both the base learner and the ensemble technique. For instance, decision trees can naturally handle missing values by treating them as a separate split or by using surrogate splits. When aggregating predictions, ensemble methods like bagging and Random Forests can simply average or vote based on available data. Boosting might require imputation or specific techniques to handle missing values in the sequence of models.

21. How does the number of base learners in an ensemble method affect its performance?

Answer: Increasing the number of base learners can enhance the ensemble's performance up to a point, as averaging or voting over a larger set can lead to more robust and stable predictions. However, beyond a certain threshold, performance gains may diminish, and computational costs will continue to rise.

22. What is out-of-bag (OOB) error in the context of Random Forests?

Answer: OOB error is a method to estimate the generalization error of a Random Forest. Since each tree is trained on a bootstrap sample, around one-third of the data is left out and not used for training that particular tree. This "left-out" data, or "out-of-bag" data, can be used to validate the tree. The OOB error is the average error rate of each tree on its OOB samples.

23. Why is boosting sensitive to noisy data and outliers?

Answer: Boosting focuses on correcting the errors of previous models. If there's noise or an outlier, boosting will give significant attention to these data points in subsequent iterations, potentially leading to overfitting or skewed predictions.

24. What is "feature bagging," and how does it differ from traditional bagging?

Answer: Feature bagging involves randomly selecting a subset of features for training each base learner, rather than using all features. This introduces an additional layer of diversity among the learners. While traditional bagging resamples data points, feature bagging resamples features.

25. How do ensemble methods work in the context of unsupervised learning, like clustering?

Answer: Ensemble methods can be adapted for unsupervised tasks. For clustering, multiple clustering algorithms or configurations can be run. The results can then be combined using methods like majority voting, where each data point's cluster assignment is decided by the majority assignment across all base learners.

26. How do hyperparameters of base learners affect the overall performance of ensemble methods?

Answer: The performance of ensemble methods is directly influenced by the hyperparameters of their base learners. For instance, the depth of trees in a Random Forest or boosting method can determine the bias-variance trade-off. Properly tuning these hyperparameters is crucial for optimizing ensemble performance.

27. Can ensemble methods be combined? For example, can boosting be applied to a bagged model?

Answer: Yes, ensemble methods can be nested or combined. For instance, one could boost a model where the base learner itself is an ensemble like a bagged model. However, combining ensembles increases complexity and computational cost, so there should be a clear performance benefit.

28. How does the choice of loss function in boosting affect the algorithm?

Answer: The loss function defines the optimization problem that boosting is trying to solve. Different loss functions can make the algorithm focus on different aspects of the data. For instance, a least squares loss might be used for regression problems, while a logistic loss might be used for classification.

29. How do ensemble methods handle class imbalance in classification problems?

Answer: Ensemble methods can be particularly effective for class imbalance. For instance, in boosting, misclassified instances are given more weight, which can help focus on the minority class. Additionally, techniques like balanced bagging, where each bootstrap sample has an equal number of instances from each class, can be used.

30. How do ensemble methods affect model interpretability, and how can this be mitigated?

Answer: Ensemble methods, especially those combining many complex models, can reduce model interpretability. To mitigate this, methods like feature importance, partial dependence plots, and SHAP (SHapley Additive exPlanations) values can be used to understand the contributions of different features to the ensemble's predictions.