Lecture 2. Decision Trees

Date: 2023-04-27

1. Overview

Decision Trees are flowchart-like structures where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes hold a class label or a regression value. They can be used for both classification and regression tasks.

Regression Trees

  • Purpose: Used when the target variable is continuous and numeric. The objective is to partition the data in a way that reduces the variance of the target variable.

  • Leaf Value Calculation: The average of the target variable values within a leaf is used as the prediction for that leaf.

  • Split Criterion: Variance or mean squared error (MSE) reduction is commonly used as the criterion for making splits.

regression_tree

Classification Trees

  • Purpose: Used when the target variable is categorical. The goal is to partition the data in a way that improves the purity of the target classes in each subset.

  • Leaf Value Calculation: Typically, the mode (most common class) of the target variable values within a leaf is used as the prediction for that leaf.

  • Split Criterion: Measures like Gini impurity, information gain, or chi-square are commonly used to decide on the best split.

classification_tree

Trees Versus Linear Models

  • Complexity: Trees can capture complex non-linear relationships in the data, whereas linear models make assumptions about the linearity of the relationship.

  • Interpretability: Trees are more intuitive and easier to visualize, which can be advantageous for explaining decisions. In contrast, linear models provide coefficients for each feature, which can be interpreted in terms of the relationship's direction and strength.

  • Performance: While trees can fit complex datasets, they can also overfit easily. Regularized linear models, on the other hand, might not fit very complex datasets well but can be less prone to overfitting.

Advantages and Disadvantages of Trees

Advantages:

  1. Easy Interpretation: Trees are intuitive and can be visualized, making them great for understanding and explaining decisions.
  2. Minimal Data Pre-processing: They don't require feature scaling or normalization.
  3. Handle Mixed Data Types: Can deal with both categorical and numerical features.
  4. Non-parametric: Don't make strong assumptions about the underlying data distribution.

Disadvantages:

  1. Overfitting: Without constraints, trees can create overly complex models that don't generalize well.
  2. Sensitivity: Small changes in data can result in a significantly different tree.
  3. Optimality: The greedy nature of tree-building algorithms doesn't guarantee a global optimal tree.

Summary

Aspect Classification Trees Regression Trees
Variable Types Categorical or Continuous predictors;
Categorical target variable.
Categorical or Continuous predictors;
Continuous target variable.
Use Cases Medical diagnosis;
Customer segmentation;
Fraud detection;
Species classification.
Financial forecasting;
Sales prediction;
Environmental modeling.
Node Splitting Criteria Gini impurity;
Information gain (entropy-based);
Chi-square test.
Variance reduction;
Mean squared error (MSE);
Mean absolute error (MAE).
Extra Considerations Majority voting system for leaf nodes;
Handling class imbalance.
Averaging the target values for leaf nodes;
Predicting continuous values.
Pruning Reduced error pruning or cost complexity pruning. Reduced error pruning or cost complexity pruning.
Handling Missing Values Skipping splits over missing values;
Surrogate splits;
Imputation methods.
Similar methods as Classification Trees.
Consider mean imputation for target variable.

2. Regression Trees

2.1 Introduction to Regression Trees

Definition and Use Cases:

A Regression Tree is a decision tree that is used for predicting a continuous target variable. Just like a decision tree for classification tasks, the regression tree splits the dataset into subsets. However, instead of using these splits to make a final classification, they're used to make a numeric prediction.

Use Cases: 1. Real Estate Pricing: Predicting the selling price of houses based on features like area, location, number of bedrooms, etc. 2. Stock Price Forecasting: Predicting future stock prices based on historical data and possibly other financial indicators. 3. Sales Forecasting: Anticipating sales figures for products based on features like seasonality, promotions, or store locations. 4. Predicting Growth: Such as the growth of plants based on various environmental factors.

Differences between Regression Trees and Other Regression Methods:

  1. Nature of Model:

    • Regression Trees: Non-linear model, capable of capturing complex patterns by making a series of decisions based on the features of the data.
    • Traditional Regression Methods: Linear models, which assume a linear relationship between predictors and the target variable.
  2. Interpretability:

    • Regression Trees: Highly interpretable and can be visualized as a flowchart-like structure, making it easy to understand and explain.
    • Traditional Regression Methods: Can be more mathematically complex, especially with multiple predictors, making them harder to visualize and explain without a background in statistics.
  3. Data Pre-processing:

    • Regression Trees: Require little to no pre-processing. They can handle mixed data types and don't need variables to be on a common scale.
    • Traditional Regression Methods: Often require normalization or standardization, especially when predictors are on different scales.
  4. Handling Outliers:

    • Regression Trees: Tend to be more resistant to outliers since they partition data into subsets based on conditions.
    • Traditional Regression Methods: Can be sensitive to outliers, which can unduly influence the regression line.
  5. Assumptions:

    • Regression Trees: Don't make strong assumptions about the underlying distribution of data or the relationships between variables.
    • Traditional Regression Methods: Rely on assumptions like homoscedasticity, linearity, and normality, which might not always hold.

2.2 Node Splitting Criteria for Regression Trees

Regression trees predict continuous outputs, unlike classification trees that predict class labels. When building a regression tree, the goal at each node is to find a split that makes the output values within each child node as homogenous (or similar) as possible. To measure the homogeneity or the quality of a split, various criteria can be used:

Variance Reduction:

One of the fundamental techniques for regression trees is to maximize the reduction in variance after a split.

Mathematically, the variance reduction for a split is defined as:

Where:

  • and are the number of samples in the left and right child nodes.
  • is the number of samples in the parent node.

The objective is to maximize this variance reduction.

Mean Squared Error (MSE):

The mean squared error (MSE) is another criterion used in regression trees. At each split, the tree tries to minimize the MSE of the child nodes.

For a given node, the MSE is calculated as:

Where:

  • is the actual value of the i-th sample.
  • is the mean output value for the node.

To evaluate a potential split, we compute the weighted average of the MSEs for the child nodes and choose the split that results in the lowest average MSE.

Mean Absolute Error (MAE):

The mean absolute error (MAE) is the average of the absolute differences between the predicted and actual values. It's less sensitive to outliers than MSE because it doesn't square the errors.

For a given node, MAE is calculated as:

The tree will evaluate potential splits by computing the weighted average of the MAEs for the child nodes and choose the split that minimizes the average MAE.

2.3 Leaf Node Value Determination

In regression trees, when you've decided to stop growing the tree further at a particular node, that node becomes a leaf. Unlike classification trees where the leaf nodes assign a class label, in regression trees, the leaf nodes assign a continuous value. Here's how that determination is made:

Averaging the Target Values:

The most common approach for determining the value of a leaf node in a regression tree is to simply average the target values of all samples that fall into that leaf.

For instance, if the samples in a leaf node have the target values of {10, 12, 11, 9, 11}, the value assigned to that leaf node would be their average, which is 10.6.

Mathematically, for a given node with samples , the value assigned to the leaf is:

This method ensures that the predictions are representative of the data in the leaf, minimizing the error when the samples within the leaf are close to the mean.

Predicting Continuous Values:

The purpose of regression trees is to predict a continuous target value. Once the tree has been constructed and a new data point is to be predicted, the data point traverses down the tree based on the features until it reaches a leaf. The value assigned to that leaf (calculated as above) becomes the prediction for the new data point.

It's crucial to understand that the predicted value for all data points landing in a particular leaf is the same. However, by increasing the depth of the tree or using techniques like boosting or random forests, the model can capture more complex relationships and provide more nuanced predictions.

2.4 Handling Overfitting

Overfitting is a common challenge when training machine learning models, including decision trees. An overfit model performs well on the training data but poorly on new, unseen data. Here's how overfitting is addressed in regression trees:

Tree Pruning:

Tree pruning involves removing certain sub-trees or leaves from a fully grown tree. The idea is to go back on the decision of creating a split, removing unnecessary complexity and thus making the model more general.

  • Cost Complexity Pruning (also known as Weakness Pruning): This technique introduces a penalty term for the number of terminal nodes (leaves) in the tree. The tree's complexity is weighed against its fit to the data. By adjusting a complexity parameter, the tree is pruned to minimize the sum of its misclassifications plus the penalty term.

Maximum Depth Specification:

One way to prevent a tree from becoming too complex (and thus overfitting) is to define the maximum depth the tree can grow. This means setting a cap on how many questions (splits) we can ask before making a decision (reaching a leaf). By doing so, the model might not capture all nuances in the training data but will likely generalize better to new data.

Minimum Samples per Leaf:

Another approach is to set a minimum threshold on the number of samples required for a node to become a leaf. If a node has fewer samples than the specified threshold, it won't be allowed to split, even if it hasn't reached the maximum depth. This prevents the tree from making decisions based on small sets of data, which are often noisy and lead to overfitting.

Cross-Validation for Optimal Tree Depth:

Instead of arbitrarily choosing a maximum depth or a minimum sample threshold, you can use cross-validation to find the optimal tree size. The idea is to:

  1. Train trees of varying depths on the training data.
  2. Validate each tree on a separate validation dataset.
  3. Select the tree depth that gives the best performance on the validation data.

This approach ensures that the chosen parameters yield a tree that generalizes well to new data, as the validation set acts as a proxy for unseen real-world data.

2.5 Real-world Applications and Examples

Decision trees, especially regression trees, have found applications in a myriad of fields due to their versatility, simplicity, and interpretability. Here are some real-world examples of how regression trees are used:

Financial Forecasting:

  • Stock Market Prediction: Traders and financial analysts use regression trees to predict stock prices based on a variety of indicators like previous stock price, trading volume, economic news, etc. This helps in making informed decisions about buying, selling, or holding stocks.

  • Credit Scoring: Financial institutions employ regression trees to predict the probability of a loan applicant defaulting based on features such as income, employment history, credit history, and more. This helps in determining the creditworthiness of an individual and deciding the interest rate on loans.

Sales Prediction:

  • Retail Inventory Management: Retail chains use regression trees to predict future sales of products based on historical sales data, promotional activities, holidays, and other relevant factors. This aids in inventory management and ensures that products are stocked adequately to meet demand without resulting in overstock.

  • Real Estate Price Prediction: Realtors and property investors leverage regression trees to predict property prices based on features like location, square footage, age of the property, amenities, and proximity to schools or public transport. This aids buyers, sellers, and investors in making informed decisions.

Environmental Modeling:

  • Temperature Prediction: Scientists and meteorologists use regression trees to predict future temperatures based on historical data and several atmospheric variables. These predictions can help in planning agricultural activities, predicting power consumption, or preparing for extreme weather conditions.

  • Precipitation Forecast: Using factors like humidity levels, cloud cover, wind patterns, and past precipitation data, regression trees can predict the likelihood and amount of rain or snowfall in a particular area. Such predictions are invaluable for water resource management, agriculture, and disaster preparedness.


3. Classification Trees

Classification trees, as the name suggests, are used for classification tasks where the objective is to assign a discrete label to an instance based on its features.

3.1 Introduction to Classification Trees

Definition and Use Cases:

Definition: A classification tree is a decision tree where the target variable is categorical in nature. At each internal node of the tree, a decision is made based on the value of an input feature. This decision guides the data down a branch of the tree. Once it reaches a leaf node, a class label is assigned.

Use Cases:

  • Medical Diagnosis: Classification trees can be used to diagnose diseases based on symptoms. Each symptom can guide a decision until a potential diagnosis is reached.

  • Spam Email Detection: By analyzing the content and metadata of emails, classification trees can classify emails as "spam" or "not spam".

  • Customer Segmentation: Businesses use classification trees to segment customers into different groups based on their purchasing behavior, preferences, and other features. This can guide targeted marketing strategies.

  • Fraud Detection: Financial institutions use classification trees to detect potentially fraudulent activities by examining transaction details and patterns.

Differences between Classification Trees and Other Classification Methods:

  • Interpretability: Classification trees offer clear visual interpretations. Each decision can be traced back through the tree, allowing for a transparent understanding of how a particular decision was made. This is not always the case with methods like neural networks, which are often seen as "black boxes".

  • Non-Linearity: Unlike linear classifiers like logistic regression, classification trees can capture non-linear relationships without the need for feature engineering.

  • Feature Interactions: Trees naturally capture interactions between features. For example, if a disease is only diagnosed when two specific symptoms occur simultaneously, a tree can represent this interaction effortlessly.

  • Decision Boundary: Classification trees create rectangular decision boundaries in the feature space, unlike methods like SVM or logistic regression that can create linear or curved boundaries.

  • No Need for Scaling: Trees are not sensitive to the scale of data. Whether a feature ranges between 0-1 or 0-1000, it doesn’t affect the performance of the tree, which is different from methods like K-Nearest Neighbors or SVM where feature scaling is crucial.

3.2 Node Splitting Criteria for Classification Trees

When constructing a classification tree, it's essential to decide how to split the data at each node. This decision largely determines the quality of the tree. Several criteria can be used, each with its merits and considerations.

Gini Impurity:

Definition: Gini impurity measures the disorder of a set of items. It's calculated as:

Where is the proportion of items labeled with class in the set.

Use: When evaluating a potential split, the weighted impurity of the two resulting nodes is compared to the impurity of the original node. The split that results in the largest impurity decrease is chosen.

Merits: It's computationally faster as it doesn't involve logarithm calculations like entropy.

Information Gain (Entropy-based):

Definition: Entropy measures the uncertainty or randomness of a set of items. For classification, it's defined as:

Where is the proportion of items labeled with class in the set.

Use: Information gain is the reduction in entropy achieved because of the split. It's calculated as the difference between the entropy of the original set and the weighted entropy of the two resulting sets. The split with the highest information gain is selected.

Merits: Tends to produce balanced trees, especially when the classes are of approximately equal size.

Chi-square Test:

Definition: The Chi-square test measures the statistical significance of the difference between the observed and expected frequencies in one or more categories. It's used to test the independence of two categorical variables.

Use: For each potential split, the observed class distributions in the resulting nodes are compared to the expected distributions if the split had no information. The split that has the most statistically significant difference (highest chi-square value) is selected.

Where is the observed frequency, and is the expected frequency.

Merits: It gives a statistical measure of the difference, making it robust against splits that might happen due to random chance in the data.

3.3 Leaf Node Value Determination

In classification trees, once an instance reaches a leaf node, a class label needs to be assigned. The process of determining this label and dealing with potential class imbalances is critical to the tree's performance.

Majority Voting System:

Definition: The majority voting system is a simple and commonly used method to decide the class label of a leaf node. It assigns the most common class label among the training instances that reach that leaf.

Use: - If a leaf node has, for instance, 7 samples of class A and 3 of class B, it will predict class A for any new instance that reaches this node.

Merits: - Intuitive and straightforward. - Typically provides a good baseline for classification tasks.

Handling Class Imbalance in Decision Trees:

Class imbalance, where one class has significantly more instances than the other(s), can distort the tree's decisions. Trees might become biased towards the majority class, leading to poorer performance for minority classes. Here's how to handle it:

  1. Weighted Node Splits: Instead of using raw counts, use weighted counts to evaluate splits. For instance, if class A is the minority and is deemed more important, instances of class A might be given more weight than those of class B during node splits.

  2. Cost-sensitive Learning: Assign different misclassification costs to different classes. The tree algorithm is then guided to minimize the total cost rather than the total error rate. This can steer the tree towards better classifying the minority class.

  3. Synthetic Minority Over-sampling Technique (SMOTE): It's a pre-processing step where synthetic samples are generated for the minority class, helping to balance the class distribution.

  4. Use of Balanced Accuracy: Instead of traditional accuracy, balanced accuracy, which is the average of recall obtained on each class, can be used as a metric to guide the tree-building process.

  5. Pruning: Overfitting can be more pronounced with class imbalances. Pruning the tree can help in reducing the bias towards the majority class by removing branches that might be overly influenced by outliers or noise in the majority class.

3.4 Handling Overfitting

Overfitting is a common concern with decision trees, especially deep ones. When a tree overfits, it captures noise in the training data, making it perform poorly on unseen data. Below are strategies to combat overfitting in classification trees.

Tree Pruning:

Definition: Tree pruning involves reducing the size of a tree by removing sections of the tree that provide little power in predicting target values.

Use: - After a tree is built, subtrees are removed if removing them doesn't significantly impact the tree's performance on a validation set.

Merits: - It simplifies the tree, making it easier to understand and interpret. - Helps in reducing overfitting and often leads to an increase in the tree's predictive accuracy on unseen data.

Maximum Depth Specification:

Definition: This is a pre-pruning method where you set a limit on how deep the tree can grow.

Use: - When building the tree, once the specified depth is reached, no new splits are made, regardless of the potential information gain.

Merits: - A shallower tree is less likely to overfit. - Forces the tree to capture the most significant patterns first.

Minimum Samples per Leaf:

Definition: It's a constraint on the minimum number of samples a node must have to become a leaf node.

Use: - If, during the tree-building process, splitting a node would result in a leaf with fewer samples than the specified threshold, the split isn't made.

Merits: - Prevents the tree from making splits that capture noise or outliers. - Ensures that each decision in the tree is based on a sufficiently large sample of data.

Cross-Validation for Optimal Tree Depth:

Definition: Cross-validation involves splitting the training data into multiple subsets and training/testing the model on these subsets to get an unbiased estimate of model performance.

Use: - Different tree depths are tried, and their performance is assessed using cross-validation.

Merits: - Helps in finding the tree depth that generalizes best to unseen data. - Reduces the risk of overfitting by preventing the tree from becoming too deep while still allowing it to capture significant patterns.

3.5 Real-world Applications and Examples of Classification Trees

The intuitive nature of decision trees and their visual representation make them widely applicable across various domains. Here are some of the prominent real-world applications and examples of classification trees:

Medical Diagnosis:

Description: - Medical practitioners often use decision trees to identify diseases or conditions based on symptoms, medical history, and diagnostic test results.

Example: - A classification tree can be trained on patient data to determine the likelihood of a patient having heart disease based on features like age, cholesterol levels, chest pain type, and EKG results.

Benefits: - Provides a clear decision-making process that doctors can follow. - Assists in early diagnosis and treatment decisions.

Customer Segmentation:

Description: - Companies use decision trees to categorize their customer base into distinct segments based on purchasing behavior, demographics, and other features.

Example: - A retailer could segment its customers into groups like "frequent shoppers," "occasional shoppers," and "rare shoppers" based on their purchase frequency, average spend, and product preferences.

Benefits: - Helps companies tailor marketing strategies for specific segments. - Aids in optimizing resource allocation towards the most valuable customer groups.

Fraud Detection:

Description: - Financial institutions use decision trees to detect suspicious transactions or behavior that might indicate fraudulent activity.

Example: - A bank might use a classification tree to flag potentially fraudulent credit card transactions based on features like transaction amount, location, time of purchase, and merchant type.

Benefits: - Efficient and automated detection of suspicious activities. - Reduction in financial losses due to fraud.

Species Classification:

Description: - Biologists and ecologists often employ decision trees to classify species based on physical characteristics, habitats, and behaviors.

Example: - A classification tree might be used to categorize different bird species based on features like feather color, beak shape, song type, and nesting habits.

Benefits: - Facilitates rapid identification and classification of species in the field. - Aids in biodiversity studies and conservation efforts.

3.6 Handling Missing Values in Classification Trees

Missing data is a common challenge in the realm of data analytics and machine learning. When working with decision trees, especially classification trees, managing missing values becomes critical. The good news is, classification trees provide us with several strategies to deal with them.

Skipping splits over missing values:

Description: - When the algorithm encounters a feature with a missing value during the tree construction, it can simply skip considering that feature for the split at that node.

Pros: - Simple to implement and doesn't require any complex computation.

Cons: - Could result in a suboptimal split if many values are missing. - Loss of potential information from other non-missing values in that feature.

Surrogate splits:

Description: - If the primary feature chosen for a split has missing values, surrogate splits use backup rules based on other features to make the split decision. The backup feature (or features) is chosen based on how well it mimics the split of the primary feature.

Pros: - Ensures that all samples, including those with missing values, are used in the split. - Often results in more accurate and reliable tree structures.

Cons: - Increases the complexity of the tree-building process. - May need multiple backup features, especially if missing values are prevalent.

Imputation methods:

Description: - Imputation involves filling in the missing values with estimated values. Common methods include mean imputation (filling in with the mean value of the feature), median imputation, mode imputation, or using model-based techniques like k-Nearest Neighbors.

Pros: - Creates a complete dataset, allowing the tree to utilize all features without skipping any. - Can enhance the accuracy if imputation is done thoughtfully.

Cons: - Imputed values might not always reflect the true underlying data distribution. - Could introduce bias if the imputation method isn't chosen carefully.


4. Q&A

1. Question: What is the primary difference between regression trees and classification trees?

Answer: Regression trees are used to predict a continuous output variable, while classification trees predict a categorical output.


2. Question: How is the best split decided in a regression tree?

Answer: The best split in a regression tree is often determined by maximizing the reduction in variance or minimizing the Mean Squared Error (MSE) or Mean Absolute Error (MAE) for the target values of the training samples that fall within the split.


3. Question: Name a common metric used to decide the best split in a classification tree.

Answer: Common metrics include Gini impurity, information gain (based on entropy), and the chi-square test.


4. Question: What is tree pruning, and why is it important?

Answer: Tree pruning involves removing branches from a fully grown tree to avoid overfitting. It's important because it helps to ensure that the model generalizes well to unseen data, rather than just fitting the training data too closely.


5. Question: In classification trees, what method is often used to determine the value of a leaf node?

Answer: The majority voting system is typically used where the leaf node is assigned the class that has the most samples (or votes) within that leaf.


6. Question: How do decision trees handle categorical input features?

Answer: Decision trees can directly handle categorical features by splitting on the categories themselves. For instance, if a categorical feature has three categories A, B, and C, possible splits might be A vs. B & C, or A & B vs. C.


7. Question: What is a surrogate split in the context of missing values in decision trees?

Answer: A surrogate split provides backup rules for deciding splits in the presence of missing values. If the primary feature chosen for a split has missing data, surrogate splits use alternative features to make the split decision, based on how well these alternatives mimic the primary feature's split.


8. Question: Why might one choose to limit the maximum depth of a decision tree?

Answer: Limiting the maximum depth helps in preventing the tree from becoming too complex and overfitting the training data. A shallower tree is also more interpretable and easier to understand.


9. Question: Can decision trees handle non-linear relationships between features and the target variable?

Answer: Yes, decision trees are inherently capable of modeling non-linear relationships without the need for feature transformation, making them powerful tools for a variety of tasks.


10. Question: How do decision trees handle class imbalance, and why is it a concern?

Answer: Decision trees can be sensitive to class imbalance, often biased towards the majority class. Techniques such as balanced bootstrapping, cost-sensitive learning, or using the "class weight" parameter in some implementations can help trees perform better on imbalanced datasets.


11. Question: What is a terminal node or leaf in a decision tree?

Answer: A terminal node, or leaf, is an end node in a decision tree where predictions are made. For classification trees, it represents the class label, and for regression trees, it represents a continuous value.


12. Question: How does a decision tree algorithm deal with continuous input features?

Answer: For continuous input features, the decision tree identifies optimal split points. For instance, if a feature is age, it might split as "age <= 30" and "age > 30".


13. Question: Why are decision trees considered non-parametric methods?

Answer: Decision trees are non-parametric because they make no assumptions about the functional form of the transformation between inputs and output, allowing them to adapt to any relationship in the training data.


14. Question: How can decision tree models be visualized?

Answer: Decision tree models can be visualized as flowcharts, where each internal node represents a decision on a feature, each branch represents an outcome of the decision, and each leaf node represents a prediction.


15. Question: What's the advantage of using the Gini impurity over information gain (entropy) for a classification tree?

Answer: Both metrics are often very similar in performance. However, Gini impurity doesn't require computing logarithms and can be faster to compute than entropy, especially when building deep trees or trees on large datasets.


16. Question: How does cross-validation help in determining the optimal depth of a decision tree?

Answer: By using cross-validation, different tree depths can be evaluated on separate subsets of the training data. This helps in identifying a depth that achieves good performance on unseen data, reducing the likelihood of overfitting.


17. Question: In what scenario might a tree with a "maximum depth of 1" be particularly useful?

Answer: A tree with a maximum depth of 1, also known as a "decision stump," can be useful as a weak learner in ensemble methods like AdaBoost.


18. Question: How do regression trees handle outliers in the target variable?

Answer: Regression trees can be sensitive to outliers since the splits are determined based on reducing variance (or other criteria like MSE). An outlier can heavily influence the average value in a leaf, potentially leading to suboptimal splits.


19. Question: What is the chi-square test used for in the context of decision trees?

Answer: In decision trees, the chi-square test is used to determine the statistical significance of the differences between observed and expected frequencies of the target variable. This can help in deciding if a split is genuinely capturing meaningful patterns.


20. Question: How can feature importance be determined from a decision tree?

Answer: Feature importance in a decision tree can be determined based on the number of times a feature is used to split the data, weighted by the reduction in impurity (for classification) or variance (for regression) it provides each time.