Lecture 4. Model Building Part 1
Date: 2023-02-21
1. Data Preprocessing
Data preprocessing is the initial step in the data mining process, where data from various sources is cleaned, transformed, and integrated to produce a dataset suitable for analytical modeling.
Major Tasks
Data Cleaning:
- Deals with the identification and correction of errors and inconsistencies in data to improve its quality.
- It involves handling missing data, smoothing noisy data, and detecting and removing outliers.
Data Integration:
- Involves merging data from different sources and ensuring a consistent view.
- Address issues like redundancy and inconsistency.
Data Transformation:
- Process of converting data into a suitable format or structure for analysis.
- Normalization (scaling all numeric variables to a standard range) and standardization (shifting the distribution of each attribute to have a mean of zero and a standard deviation of one) are typical methods used.
Data Reduction:
- Reducing the volume but producing the same or similar analytical results.
- Common methods include dimensionality reduction, binning, histograms, clustering, etc.
Missing Values
Missing data can lead to biased or incorrect results when not properly handled.
Types of Missing Values:
-
MCAR (Missing Completely At Random): Data missing is independent of any other variable, meaning that the probability of a particular data point being missing is the same for all observations. For example, a survey taker forgets to fill out some responses entirely at random, with no pattern or reason.
-
MAR (Missing At Random): Data missing is not random but can be related to some other variable. For example, income may be more likely to be missing for younger individuals than older ones. For example, younger respondents skip a survey question about retirement plans, where the missingness is related to their age but not directly to their actual retirement plans.
-
MNAR (Missing Not At Random): Data missing is dependent on the unobserved missing data itself. For instance, people with high salaries might not want to reveal their earnings, making their income more likely to be missing. For example, people with higher-than-average salaries choose not to disclose their income on a survey because they feel it's too personal.
Methods for Handling Missing Values:
-
Listwise Deletion: Removes all data for an observation that has one or more missing values.
-
Pairwise Deletion: Uses all available data by variable. "Pairwise" refers to the fact that each variable is analyzed separately, so if a variable has missing values, only the observations with missing values for that variable are excluded from the analysis. This method is often used in regression analysis.
-
Single Imputation: Replace missing data with a guessed value. Common methods include mean, median, mode imputation.
-
Multiple Likelihood: Makes the assumption that the data is MAR and then models the data to predict missing values.
-
Multiple Imputation: Like single imputation but involves creating multiple copies of the dataset and imputing values in each copy. Averages are computed across all datasets for predictions.
Summary:
Method | MCAR | MAR | MNAR |
---|---|---|---|
Listwise Deletion | Unbiased; Large Standard Error | Biased; Large Standard Error | Biased; Large Standard Error |
Pairwise Deletion | Unbiased; Inaccurate Standard Error | Biased; Inaccurate Standard Error | Biased; Inaccurate Standard Error |
Single Imputation | Often biased; Inaccurate Standard Error | Often biased; Inaccurate Standard Error | Biased; Inaccurate Standard Error |
Multiple Likelihood | Unbiased; Accurate Standard Error | Unbiased; Accurate Standard Error | Biased; Accurate Standard Error |
Multiple Imputation | Unbiased; Accurate Standard Error | Unbiased; Accurate Standard Error | Biased; Accurate Standard Error |
Note:
-
"Bias/Unbiased" refers to whether the estimates are centered around the true parameter value. If the estimates tend to be off in a particular direction from the true value, then they're considered biased. If the estimates, on average, are right on target, they're considered unbiased.
-
"Standard Error" indicates the variability of the estimates. If the standard error is large, then the estimates can be widely spread out and may be unreliable. Accurate standard errors mean the method tends to produce estimates close to the true parameter value, while inaccurate ones mean the method's estimates might be more spread out.
2. Handling Outliers
Dealing with outliers is essential since they can distort statistical analyses and models. Here are several methods to handle outliers:
-
Visualization:
- Before any statistical computations, always visualize the data. Tools like box plots, scatter plots, and histograms can help in identifying outliers.
-
Z-Score:
- The z-score represents how many standard deviations a data point is from the mean. A common threshold is a z-score of 2.5 or 3. Data points with z-scores beyond this threshold can be considered outliers.
-
IQR (Interquartile Range):
- The IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile) in the data. Any data point that lies below
Q1 - 1.5 * IQR
or aboveQ3 + 1.5 * IQR
can be considered an outlier.
- The IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile) in the data. Any data point that lies below
-
Trimming:
- This involves removing the outliers from the dataset. However, it can lead to loss of information.
-
Log Transformation:
- If the dataset is positively skewed, applying a logarithm can compress the long tail and spread out data on the lower end.
-
Domain Knowledge:
- Sometimes, outliers might be genuine extreme values, and not errors. For instance, in a dataset of salaries, the income of a CEO might be a genuine outlier. In such cases, one must make informed decisions based on domain knowledge.
-
Robust Statistical Methods:
- Some statistical methods and algorithms, like the median or robust regression methods, are naturally resistant to outliers.
-
Residual Analysis:
- For models, inspect the residuals (difference between observed and predicted values). Large residuals can indicate outliers or that the model isn't fitting well.
-
Cluster Analysis:
- By grouping data into clusters, outliers can be identified as those data points that do not belong to any cluster.
3. Handling Class Imbalance
-
Resampling Techniques:
- Oversampling: Increase the number of instances in the minority class by duplicating samples or generating synthetic samples using methods like SMOTE (Synthetic Minority Over-sampling Technique) or ADASYN (Adaptive Synthetic Sampling).
- Undersampling: Reduce the number of instances in the majority class. This can be done randomly or using techniques like Tomek links or neighborhood cleaning rule.
-
Algorithm-level Approaches:
- Cost-sensitive Training: Assign higher misclassification costs to the minority class.
- Anomaly Detection: Treat the minority class as an anomaly detection problem.
- Ensemble Methods: Use ensemble methods like bagging and boosting with base classifiers that can handle imbalanced datasets.
-
Using Different Evaluation Metrics:
- Accuracy might not be a suitable metric for imbalanced datasets. Instead, consider using:
- Precision
- Recall
- F1-score
- Area under the Precision-Recall curve
- Matthews correlation coefficient
- Accuracy might not be a suitable metric for imbalanced datasets. Instead, consider using:
-
Data-level Approaches:
- Creating Synthetic Samples: Apart from SMOTE, techniques like Borderline-SMOTE and SVM-SMOTE can be used to generate synthetic samples.
- Creating Clusters: Divide the majority class into clusters and treat each cluster as a separate class. This transforms the multi-class problem into a balanced one.
-
Hybrid Methods:
- Combine both oversampling and undersampling techniques. For example, the SMOTE + ENN (Edited Nearest Neighbors) method first oversamples the minority class and then cleans the data using undersampling.
-
Change the Decision Threshold:
- By default, many algorithms use 0.5 as the decision threshold. Adjusting this threshold can help improve sensitivity and specificity.
-
Domain-specific Techniques:
- In some specific areas like fraud detection, domain-specific techniques and heuristics are developed to handle the imbalance.
4. Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is the initial and an essential step in the data analysis process, where the main features of the data are visualized and summarized with descriptive statistics. It helps analysts to understand the patterns, relationships, anomalies, and structures in the data to formulate hypotheses and choose the right statistical tests.
Objectives of EDA:
- Understand the Data: Grasp the distribution, trends, and relationships within the data.
- Identify Outliers and Anomalies: Spot any data points that fall outside of what's expected.
- Test Assumptions: Validate assumptions related to the data for further statistical procedures.
- Inform Model Selection and Feature Engineering: Decide which models to use and if new features should be created.
Major Components:
- Descriptive Statistics:
- Measures of Central Tendency: Mean, median, and mode.
- Measures of Dispersion: Standard deviation, variance, range, and interquartile range.
-
Shape of the Distribution: Skewness and kurtosis.
-
Data Visualization:
- Univariate Analysis: Histograms, box plots, density plots to visualize the distribution of individual variables.
- Bivariate Analysis: Scatter plots, pair plots, and cross-tabulations to explore relationships and interactions between two variables.
-
Multivariate Analysis: Heat maps, parallel coordinate plots, and 3D scatter plots to visualize relationships among three or more variables.
-
Handling Missing Values: Identifying and imputing missing data.
-
Outlier Detection: Using techniques like Z-scores, IQR, or visualizations like box plots.
-
Correlation Analysis: Understanding how different variables relate to one another.
-
Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) for data with a large number of features.
Best Practices:
- Iterative Process: EDA is best approached as an iterative process. As you explore, you may find new questions to ask or hypotheses to test.
- Stay Open-Minded: The purpose is to explore and ask questions, not to confirm pre-existing beliefs.
- Visualize Everything: Often, patterns in the data can be best understood visually.
- Document Your Findings: EDA can inform future analysis or modeling, so it's essential to record your observations and insights.
Challenges:
- Large Datasets: EDA can be resource-intensive with big data. Sampling or aggregation might be required.
- Messy Data: Data in the real world can be unstructured and messy. Cleaning the data can sometimes take more time than the actual analysis.
- Multiple Hypotheses: The risk of multiple comparisons can lead to spurious findings.
5. Feature Engineering
Feature engineering is the process of selecting, transforming, or creating new variables (features) in a dataset to enhance the performance and interpretability of machine learning models. It's an essential step because the right features can drastically improve model accuracy, while poor or irrelevant features can have a detrimental effect.
Why is Feature Engineering Important?
- Improved Model Performance: Models often perform better with input features that are tailored to enhance the underlying patterns in the data.
- Reduced Overfitting: By creating features that capture the essential aspects of the data and removing noisy or redundant ones, you can decrease the risk of overfitting.
- Enhanced Interpretability: Intuitive features make models more understandable, aiding in the explanation of model predictions.
Feature Engineering for Categories:
Ordinal Encoding
Label Encoding: - Assigns each unique category in a feature to an integer value from 0 to (N-1), where N is the number of distinct categories. This method does not consider any order of categories.
Ordinal Encoding: - Like label encoding, but the assignment of integers is based on some kind of order or hierarchy in the categories.
Nominal Encoding
One-Hot Encoding: - Creates a new binary column for each category in the feature. For N categories, it results in N new binary features, with a '1' indicating the presence of the category and '0' indicating the absence.
Dummy Encoding: - Similar to one-hot encoding but avoids the dummy variable trap by creating N-1 binary columns for N categories.
Mean or Target Encoding: - Each category is replaced with the mean value of the target variable for observations that have that category. Care should be taken to prevent data leakage.
Frequency Encoding: - Each category is replaced with the frequency or proportion of occurrences of that category in the entire feature.
Feature Engineering for Numbers
Quantization or Binning: - Divide a continuous variable into several bins or intervals and assign a unique integer or label to each bin.
Log Transformation: - Apply the logarithm function to a feature to handle skewed data or to linearize exponential relationships.
Power Transformation: - Generalize a family of transformations (e.g., Box-Cox or Yeo-Johnson) that apply various exponents to a feature to achieve a more Gaussian or desired distribution.
Feature Scaling or Normalization: - Adjust the scale of features using methods like Min-Max scaling, Z-score normalization, or Robust scaling.
Feature Engineering for Text
Bag of Words: - Represent text data by counting the number of times each word appears in a document. This results in a sparse matrix where rows are documents and columns are unique words.
Bag of n-Grams: - Similar to bag of words, but instead of single words, it uses sequences of n words. For example, bigrams (n=2) for the sentence "I love you" are: "I love" and "love you".
Filtering:
- Stop Words:
- Remove common words that do not add much meaning to the text, like 'and', 'the', etc.
- Stemming:
- Reduce words to their root form by removing inflections. For example, "running" becomes "run".
- Part-of-Speech (POS) Tagging:
- Assign a POS tag to each word in the text, like noun, verb, adjective, etc.
- Tokenization:
- Break text into individual words or tokens.
- TF-IDF (Term Frequency-Inverse Document Frequency):
- Measure to represent the importance of a term in a document relative to a collection of documents. It takes into account both local (term frequency) and global (inverse document frequency) information.
Feature Engineering for Dates:
Year, Month, Day Extraction: - Extract the year, month, and day as separate features from a date. This can help in identifying monthly or yearly trends, or any seasonality present in the data.
Day of the Week: - Convert dates to the day of the week (e.g., Monday, Tuesday). This can be useful in scenarios like retail, where purchasing behavior might differ between weekdays and weekends.
Weekend/Weekday Indicator: - Create a binary indicator to specify if a date falls on a weekend or a weekday.
Time Since: - Calculate the duration (in days, months, or years) from a particular date until a significant event. For instance, the number of days since the last holiday.
Day of the Year: - Convert the date to its corresponding day number in the year (from 1 to 365/366).
Quarter: - Extract the quarter of the year in which the date falls (Q1, Q2, Q3, Q4).
Is Holiday Indicator: - A binary feature indicating if the date is a public holiday or not.
Season Extraction: - Depending on the region, categorize the date into seasons, like Spring, Summer, Fall, or Winter.
Extract Hour, Minute, Second (for Timestamps): - If the date feature includes time, extract hour, minute, and second as separate features. Useful in scenarios where time of the day matters, like web traffic analysis.
Elapsed Time Since a Reference Point: - Calculate the time (in appropriate units) since a fixed reference date. This continuous feature can be useful in models that require understanding the progression of time.
Feature Selection vs. Feature Engineering:
- Feature Selection: This is about choosing the most relevant subset of features from the original set, often to reduce dimensionality or multicollinearity.
- Feature Engineering: This involves creating new features or transforming existing ones to better represent the underlying patterns in the data.
Challenges:
- Overengineering: Creating too many features can lead to overfitting, where a model performs well on training data but poorly on unseen data.
- Computational Complexity: Adding many new features can significantly increase the computation time for training models.
- Lack of Domain Knowledge: Without understanding the domain, one might miss essential features or create irrelevant ones.
Best Practices:
- Iterate: Feature engineering is an iterative process. Create features, test their impact, refine, and repeat.
- Collaborate: Work with domain experts to uncover potential insightful features.
- Validate: Always validate the effect of new features on model performance using out-of-sample data.
- Prioritize Simplicity: A small number of meaningful features often works better than a large number of constructed ones.
6. Data Splitting
Data splitting is the process of dividing a dataset into multiple subsets, commonly used in machine learning and statistics to assess the performance of algorithms. Proper data splitting ensures that a model doesn't just memorize the data (overfitting) but generalizes well to unseen data.
Why is Data Splitting Important?
- Assess Model Performance: By training on one subset and testing on another, we can estimate how the model will perform on unseen data.
- Prevent Overfitting: Overfitting occurs when a model learns the training data too well, including its noise and outliers. Splitting data allows us to detect if this is happening.
Common Methods:
-
Train/Test Split:
- The dataset is divided into two parts: a training set and a test set.
- The model is trained on the training set and evaluated on the test set.
- This method is simple but can lead to high variance in model performance if the split is not representative.
-
K-Fold Cross-Validation:
- The dataset is divided into 'k' subsets (or "folds").
- The model is trained on 'k-1' folds and tested on the remaining fold.
- This process is repeated 'k' times, with each fold serving as the test set once.
- The model's performance is averaged over the 'k' trials.
- Helps in reducing the variance from a single random train/test split.
-
Stratified Sampling:
- Used when the data has imbalanced classes.
- Ensures that each subset maintains the same proportions of classes as in the original dataset.
- Especially crucial for datasets where one class significantly outnumbers the other(s).
-
Time-Series Split:
- For time-series data where chronological order matters.
- Data is split chronologically. For example, if we have monthly data for five years, we might train on the first four years and test on the final year.
-
Leave-One-Out Cross-Validation (LOOCV):
- A variation of k-fold cross-validation where 'k' is equal to the number of data points.
- Each data point is used as a test set exactly once.
- Computationally expensive but reduces variance in performance estimates.
Considerations:
- Size of Splits: The typical split ratio for train/test is 70/30 or 80/20, but it can vary based on the dataset's size and the specific problem.
- Data Leakage: Ensure that no data used in the testing phase is available during training. This includes not just the target variable but also features that would be unavailable at prediction time.
- Random vs. Deterministic Splits: While random splits are common, deterministic splits (e.g., always using the last month's data as a test set) can be useful, especially when there's a temporal or ordered aspect to the data.
Bootstrap
Bootstrap is a resampling technique used to estimate the distribution of a statistic (like the mean or variance) by creating multiple samples drawn with replacement from the original data. This technique allows us to simulate the randomness and estimate the variability in a statistic without making any strong parametric assumptions.
Steps for Bootstrapping:
- Randomly draw a sample from your dataset with replacement. This sample should be of the same size as your original dataset.
- Calculate the desired statistic (e.g., mean, median) on this resampled data.
- Repeat steps 1 and 2 a large number of times (e.g., 1000 or 10000 times) to create a distribution of the statistic.
- From this distribution, you can estimate the variance, confidence intervals, or any other statistical measures of the statistic of interest.
When Bootstrap Fails:
-
Small Sample Sizes: If the original sample size is too small, bootstrapping may not capture the true underlying population distribution effectively. Repeatedly sampling from a small dataset may just amplify existing biases.
-
Non-IID Data: Bootstrap assumes data are independently and identically distributed (IID). If there's a structure in the data, such as time series data, bootstrapping standard samples might break this structure, leading to incorrect inferences.
-
Highly Skewed Distributions: In cases of extreme skewness, the bootstrap may not always perform well. The resampling might miss rare but influential points, or give them too much weight if they are sampled multiple times.
-
Parametric Data with Known Distributions: If the underlying distribution of the data is known, then parametric methods might be more efficient and accurate than the bootstrap.
-
Boundary Values: In some statistical measures, especially those that involve boundaries (like extreme quantiles), bootstrap might not provide accurate estimates due to resampling potentially missing extreme values.
7. Q&A
1. Question: What do you understand by the terms MCAR, MAR, and MNAR in the context of missing data?
Answer: MCAR stands for Missing Completely At Random, which means the missingness of data is not related to any observed or unobserved data. MAR, or Missing At Random, means the missingness is related to observed data but not the missing data itself. MNAR, Missing Not At Random, implies that the missingness is related to the missing data itself.
2. Question: What strategies do you employ to handle class imbalance?
Answer: Several strategies can be used, including: - Resampling: This can be either oversampling the minority class or undersampling the majority class. - Using synthetic data: Tools like SMOTE (Synthetic Minority Over-sampling Technique) can generate synthetic samples. - Adjusting class weights: Giving more weight to the minority class during model training. - Anomaly detection techniques: Treating the minority class as an anomaly. - Using ensemble methods: Such as bagging and boosting to improve classification.
3. Question: How do you identify and treat outliers in your dataset?
Answer: Outliers can be identified using various methods such as statistical measures (e.g., z-score, IQR), visual methods like box plots, scatter plots, or using algorithms like DBSCAN. Treatment methods include: - Capping: Replacing outliers with the maximum allowed value. - Transformation: Log or square root transformation can reduce the impact of outliers. - Removing: In some cases, it might be best to remove outliers altogether. - Imputation: Replacing outliers with statistical measures like mean, median, or mode.
4. Question: What are the main objectives of Exploratory Data Analysis (EDA)?
Answer: EDA aims to: - Understand the data's main characteristics through visualization and statistical methods. - Identify patterns, relationships, anomalies, or outliers. - Formulate hypotheses and insights for further analysis. - Prepare data for modeling by identifying necessary transformations.
5. Question: Why is feature engineering important and what are some common methods?
Answer: Feature engineering is crucial because it helps improve the performance of machine learning models by creating meaningful input features from the data. Common methods include: - Binning: Grouping continuous variables into discrete bins. - Polynomial features: Creating interaction terms. - Encoding categorical variables: Like one-hot or label encoding. - Normalization/Standardization: Scaling features to a similar range. - Feature extraction: Using techniques like PCA (Principal Component Analysis) to derive new features.
6. Question: Why is it essential to split your data into training and test sets?
Answer: Splitting data allows us to assess how the model will perform on unseen data. Training the model on one dataset and testing on another ensures that the model doesn't merely memorize the data (overfitting) but generalizes well to new, unseen data.
7. Question: Describe the difference between K-Fold Cross-Validation and Stratified K-Fold Cross-Validation.
Answer: K-Fold Cross-Validation divides the dataset into 'k' subsets and uses 'k-1' subsets for training and the remaining one for testing, iteratively. Stratified K-Fold ensures that each fold maintains the same proportion of target class labels as the complete dataset, making it especially useful for imbalanced datasets.
8. Question: What do you understand by data leakage and how do you prevent it during preprocessing?
Answer: Data leakage refers to the situation where information from the test set unintentionally influences the training process. It can be prevented by: - Ensuring that any data transformation or preprocessing is learned only from the training set. - Carefully managing feature engineering to avoid including data that wouldn't be available during prediction. - Using proper data splitting techniques.
9. Question: Describe a situation where removing outliers might not be a good idea.
Answer: In cases where outliers represent significant events or rare occurrences that are crucial for the study, such as fraud detection in finance or diagnosing rare diseases in medicine, removing outliers might lead to missing out on important patterns or insights.
10. Question: Why might you prefer time-series data splitting over a regular train-test split?
Answer: For time-series data, the order of data points matters due to temporal dependencies. Using a regular train-test split might disrupt these temporal patterns. Time-series splitting ensures that past data is used to predict future events, which aligns with real-world scenarios where we don't have access to future data during training.
11. Question: Explain the impact of missing data on a dataset and the potential biases it can introduce.
Answer: Missing data can distort the representativeness and reliability of results. Depending on the type of missingness (MCAR, MAR, MNAR), it can introduce biases that make the analyses non-generalizable. For instance, if a certain group tends to have more missing responses, analyses may be biased against them.
12. Question: When might you consider using a technique like SMOTE for handling class imbalance?
Answer: SMOTE, or Synthetic Minority Over-sampling Technique, might be considered when there's a severe class imbalance in the dataset. Instead of merely oversampling the minority class with duplicate records, SMOTE creates synthetic samples, leading to a more diverse and richer dataset and potentially improving classifier performance.
13. Question: How do you decide which features to include or exclude when building a model?
Answer: Feature selection is often based on domain knowledge, correlation analyses, feature importance scores, or algorithms like recursive feature elimination. Redundant or irrelevant features can be removed to improve model efficiency and prevent overfitting.
14. Question: In EDA, what do histograms, scatter plots, and boxplots help you visualize?
Answer: Histograms help visualize the distribution of a single continuous variable, showing the frequency of data in different intervals. Scatter plots visualize relationships or associations between two continuous variables. Boxplots give a five-number summary (minimum, first quartile, median, third quartile, maximum) of a continuous variable and can also help spot outliers.
15. Question: What are the key steps involved in feature engineering?
Answer: Key steps include: - Understanding the domain and dataset. - Creating interaction features. - Encoding categorical variables. - Normalizing or standardizing features. - Handling missing values. - Reducing dimensionality if needed (e.g., using PCA). - Regularly testing and iterating on the features' performance in models.
16. Question: How do you validate that your data split between training and testing is representative of the overall data distribution?
Answer: One way is by using stratified sampling, especially when there's class imbalance. This ensures that the distribution of classes in both training and testing sets is similar to the overall dataset. Additionally, looking at descriptive statistics and visualizations can provide insights into the representativeness of the splits.
17. Question: How can you identify multicollinearity during data preprocessing, and why is it a concern?
Answer: Multicollinearity, where predictor variables in a model are correlated, can be identified using correlation matrices, scatter plots, or the Variance Inflation Factor (VIF). It's a concern because it can inflate variance, make model interpretation difficult, and lead to overfitting.
18. Question: What steps do you take during EDA to understand the nature of relationships between variables?
Answer: To understand relationships, one might: - Use correlation matrices to see linear associations between continuous variables. - Create scatter plots to visualize potential relationships and trends. - Use crosstab or chi-square tests for categorical variables. - Leverage pair plots or heatmaps for a comprehensive view.
19. Question: Explain the difference between feature selection and feature extraction.
Answer: Feature selection involves picking a subset of the original features based on their relevance to the target variable, whereas feature extraction creates a new set of features by transforming or combining the original features, as seen in methods like PCA.
20. Question: If given a dataset with a mix of continuous and categorical variables, how would you preprocess it before feeding it into a machine learning model?
Answer: - Continuous variables: Check for and handle missing values, outliers, and possibly standardize or normalize them. - Categorical variables: Handle missing values and use encoding techniques like one-hot or label encoding to convert them to numerical format. - Also, ensure that the train-test split is representative, especially if there's class imbalance.