Lecture 1. Introduction to Machine Learning

Date: 2023-01-31

1. Artificial Intelligence vs. Machine Learning vs. Data Science

Artificial Intelligence (AI)

Definition: AI is a broad field of computer science focused on creating smart machines capable of performing tasks that typically require human intelligence.
Key Points:
- Goal: Mimic human intelligence in machines.
- Types: Narrow AI (specific tasks) vs. General AI (broad understanding).
- Applications: Robotics, natural language processing, expert systems, etc.

Machine Learning (ML)

Definition: ML is a subset of AI that allows computers to learn and improve from experience without being explicitly programmed for that task.
Key Points:
- Learning: Through patterns and inference.
- Types: Supervised learning, unsupervised learning, reinforcement learning, etc.
- Applications: Recommendation systems, image recognition, fraud detection, etc.

Data Science

Definition: Data science involves using automated methods to analyze massive amounts of data and extract knowledge from them.
Key Points:
- Components: Data preparation, analysis, visualization.
- Tools: Python, R, SQL, Jupyter, pandas, etc.
- Applications: Business intelligence, predictive modeling, data mining, etc.

Comparison:

AI is the overarching concept of machines being able to carry out smart tasks.
ML is a method by which we achieve AI. It's the actual learning where a machine, by exposure to data, improves its performance in a task.
Data Science encompasses a variety of techniques for handling, analyzing, and visualizing data. ML can be a tool in this process, but not every piece of data science requires ML.

2. Types of ML

Unsupervised Learning

Clustering

Definition: Grouping similar instances together into clusters.
- Algorithms: K-Means, DBSCAN, Hierarchical Clustering, etc.
- Example:
  - Customer Segmentation: Retailers grouping customers based on their purchasing behaviors to tailor marketing strategies.
  - Targetted Marketing: Grouping customers based on their interests to recommend relevant products.
  - Recommendation Systems: Grouping users with similar movie preferences to recommend movies.

Dimensionality Reduction

Definition: Simplifying the input data by reducing the number of variables or features while retaining the essential information.
- Algorithms: PCA, t-SNE, LDA, etc.
- Example:
  - Meaningful Compression: Using algorithms like PCA to reduce the size of images for faster web loading without losing much quality.
  - Big Data Visualization: Using t-SNE to visualize high-dimensional data in 2D or 3D.
  - Structure Discovery: Using algorithms like Apriori to find hidden patterns in data.
  - Feature Extraction: Using algorithms like LDA to extract the most important features from a dataset.

Supervised Learning

Classification

Definition: Predicting a categorical class label for new instances.
- Algorithms: Logistic Regression, Decision Trees, SVM, Naive Bayes, KNN, Neural Networks, etc.
- Example:
  - Email Spam Detection: Categorizing emails as spam or not spam based on their content.
  - Image Classification: Categorizing images based on their content.
  - Diagnostic Systems: Categorizing patients into different groups based on their symptoms.

Regression

Definition: Predicting a continuous value for new instances.
- Algorithms: Linear Regression, Decision Trees, SVM, Random Forests, Neural Networks, etc.
- Example:
  - Market Forecasting: Estimating the price of a house based on features like size, location, and number of bedrooms.
  - Weather Forecasting: Estimating the temperature based on features like humidity, precipitation, and wind speed.
  - Estimating Sales: Estimating the number of customers that will purchase a product based on advertising expenditure.

Reinforcement Learning

Definition: An agent learns how to behave in an environment by performing certain actions and getting rewards or penalties in return.
- Algorithms: Q-Learning, Temporal Difference, SARSA, etc.
- Example:
  - Game Playing: Algorithms that learn to play (and often excel at) games, like AlphaGo mastering the board game Go. Another example is real-time video games where the AI adapts to the player's strategy.
  - Robot Navigation: Robots learning to navigate through a maze or an environment by trying different paths and getting positive feedback when it reaches the destination.
  - Real-Time Decision Making: Algorithms that learn to optimize the performance of a data center by taking actions like adjusting the cooling systems or allocating servers to different tasks.

3. Performance Evaluation

Evaluation Metrics

Accuracy

Definition: It is the fraction of predictions our model got right. It is computed as:

$\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}$

Example: If out of 100 predictions, 90 are correct, then accuracy is 90%.

Precision

Definition: Precision indicates the number of true positive predictions among the total positives predicted.

$\text{Precision} = \frac{\text{True Positives}}{\text{True Positives + False Positives}}$

Example: If a model predicted 50 apples and 45 of them are actually apples, then precision is 90%.

Recall

Definition: Recall shows the number of true positive predictions among the actual positives.

$\text{Recall} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}}$

Example: If there were 50 actual apples and the model correctly identified 45, the recall is 90%.

F1 Score

Definition: The F1 Score is the harmonic mean of precision and recall, providing a balance between them.

$F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision + Recall}}$

Example: If precision is 90% and recall is 90%, the F1 score is also 90%.

ROC Curve

Definition: Receiver Operating Characteristic curve visualizes the performance of a binary classifier. It plots the true positive rate against the false positive rate.
Example: In medical tests, it's used to see the threshold effect of distinguishing between sick and healthy.

Mean Absolute Error (MAE)

$\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$

Pros: 1. Interpretability: The result is intuitive and easy to understand since it provides a direct average of the error magnitudes. 2. Equal Weights: Treats all errors the same, regardless of their direction or magnitude.

Cons: 1. Lack of Sensitivity: It may not adequately reflect the performance of models in scenarios where certain types of errors are more significant than others, as all errors are treated equally.

Mean Squared Error (MSE)

$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

Pros: 1. Penalizes Larger Errors: Gives more weight to larger errors due to squaring, which can be useful if large errors are particularly undesirable. 2. Differentiability: Its smoothness and continuous nature make it more suitable for optimization using gradient descent.

Cons: 1. Less Interpretability: The result can be less intuitive than MAE since it's not in the original unit of measurement (it's squared). 2. Sensitive to Outliers: Due to the squaring of errors, it can disproportionately increase the impact of outliers or large errors.

Types of Errors

Type I Error (False Positive): Predicting an event when there isn't one.
Type II Error (False Negative): Not predicting an event when there is one.

Confusion Matrix

Definition: A table used to describe the performance of a classification model on a set of data for which the true values are known. It includes terms:
- True Positives (TP): Actual positives correctly predicted as positives.
- True Negatives (TN): Actual negatives correctly predicted as negatives.
- False Positives (FP): Actual negatives incorrectly predicted as positives.
- False Negatives (FN): Actual positives incorrectly predicted as negatives.

Confusion Matrix

4. Bias-Variance Tradeoff

Introduction

The Bias-Variance Tradeoff is a fundamental concept in machine learning that explains the balancing act between two primary sources of error in models. Every model's prediction error can be broken down into three components: bias, variance, and irreducible error.

$\text{total error} = \text{bias}^2 + \text{variance} + \text{irreducible error}$

where:

Bias: Refers to the error introduced by approximating the real-world problem, which may be complex, by a too-simple model.
Variance: Refers to the error introduced by the model's complexity in trying to fit the training data.
Irreducible Error: The noise inherent in any real-world data. It's the error that can't be reduced regardless of the algorithm used.

Bias

High bias can cause the model to miss relevant relations between features and target outputs (underfitting). This means the model is too simple to capture patterns in the data.

Example: Assuming any linear relationship in data that is actually more complex in nature.

Variance

High variance means the model is too complex and captures the noise in the training data, making it sensitive to fluctuations (overfitting).

Example: A high-degree polynomial fit in regression might capture all data points in the training set but perform poorly on unseen data.

Irreducible Error

This error stems from factors that can't be controlled, such as unknown variables or inherent noise in the data source. No matter how good the model is, this error can't be eliminated.

Model Capacity vs. Complexity

Capacity: Refers to a model's ability to fit various functions. A model with higher capacity can fit more complex functions.
Complexity: Refers to how intricate a model's hypothesis is, typically tied to its parameters.

The relationship: Increasing a model's capacity often increases its complexity, making it more prone to overfitting.

Solutions

Cross-Validation: Partition data and train the model on different subsets to gauge its generalization ability.
Regularization: Add penalty terms to the loss function to discourage overly complex models. For example, L1 (Lasso) and L2 (Ridge) regularization.
Ensemble Methods: Combine multiple models to average out their predictions, reducing variance. Techniques include bagging, boosting, and stacking.
Increasing Training Data: More data can provide a clearer signal, reducing both bias and variance.
Feature Selection: Remove irrelevant or redundant features to simplify the model.
Early Stopping: In iterative methods, stop training once validation performance starts to degrade.

5. Parametric vs. Non-parametric Methods

Introduction

In machine learning and statistics, algorithms are often categorized into parametric and non-parametric methods based on the assumptions made about the underlying distribution of the data and the model's structure.

Parametric Methods

Definition: Parametric methods make strong assumptions about the form of the mapping function between inputs and outputs. They simplify the problem by assuming a form for this function.

Pros: 1. Simplicity: They are often simpler and faster to learn from data. 2. Requires Less Data: Due to their assumptions, less data is typically needed to train them effectively. 3. Interpretable: Many parametric models, like linear regression, are easy to interpret and understand.

Cons: 1. Constrained: Their assumptions can limit their flexibility, potentially leading to poorer performance if the assumptions are incorrect. 2. Underfitting: They might oversimplify the problem, leading to underfitting.

Examples: - Linear Regression - Logistic Regression - Linear SVM - Perceptron

Non-parametric Methods

Definition: Non-parametric methods do not make strong assumptions about the form of the mapping function, allowing for more flexibility at the cost of requiring more data.

Pros: 1. Flexibility: Can fit a wide range of shapes or distributions. 2. Performance: Often yield better performance since they can adapt to the structure of the data. 3. Fewer Assumptions: No assumptions (or fewer assumptions) about the functional form of the data.

Cons: 1. Requires More Data: Typically, they need a lot of data to train effectively. 2. Computationally Intensive: Many non-parametric methods can be slower and require more memory. 3. Overfitting: Given their flexibility, they can sometimes overfit the data if not appropriately managed.

Examples: - k-Nearest Neighbors - Decision Trees - Random Forests - Kernel SVM

6. K-Nearest Neighbors (KNN)

Introduction

K-Nearest Neighbors (KNN) is a simple, yet powerful non-parametric machine learning algorithm used primarily for classification, but also regression. It's based on the intuitive idea that similar data points should have similar labels. When predicting the class of a new data point, KNN looks at the 'k' training examples that are nearest to the point and returns the most common output value among them.

Algorithm

Algorithm: K-Nearest Neighbors (KNN)
Input: Training dataset, New data point, Number of neighbors 'k'
Output: Predicted class (or value for regression)

1. For each data point in the training dataset:
   - Compute the distance between the new data point and the current training data point
2. Sort the distances in increasing order and pick the first 'k' points
3. For classification:
   - Return the mode (most frequent class) among these 'k' points
   For regression:
   - Return the average of the output values of these 'k' points

$\text{distance} = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}$ (Note: This is the formula for Euclidean distance in a 2-dimensional space, one of the most common distance metrics used in KNN. However, other distance metrics can be used depending on the application.)

Bias-Variance Tradeoff

With KNN, the bias-variance tradeoff is directly influenced by the choice of 'k':

Low k (e.g., k=1):
- High Variance / Low Bias: The predictions can be very sensitive to noise in the training data. A single mislabeled example can cause errors. It tends to overfit.
- Explanation: When k=1, the prediction for a new point is based solely on the nearest training example, without considering the broader structure of the data.
High k (e.g., k equal to the number of training examples):
- Low Variance / High Bias: The algorithm becomes more resistant to outliers in the training data. However, the decision boundary might become overly smooth and potentially miss nuances in the data. It might underfit.
- Explanation: As k increases, the prediction becomes the average of more and more points, and the predictions tend toward the dominant class, potentially ignoring smaller patterns.

A common practice is to try various values of 'k' (often odd numbers to avoid ties) and use cross-validation to choose the 'k' that results in the best performance on a validation set.

7. Matrix Algebra

Least Squares Regression

When we fit a linear regression model, we aim to find the coefficients that minimize the sum of squared residuals. Let's derive this using matrix notation:

Given:

$X$ - Design matrix (m x n)
$y$ - Response vector (m x 1)
$\beta$ - Coefficient vector (n x 1)

The prediction: $\hat{y} = X\beta$

The residuals: $e = y - \hat{y}$

The sum of squared residuals: $e^T e = (y - X\beta)^T(y - X\beta)$

To minimize this, differentiate w.r.t. $\beta$ and set to zero: $\frac{\partial}{\partial \beta} (y - X\beta)^T(y - X\beta) = 0$

This gives: $X^T(y - X\beta) = 0$ $X^Ty = X^TX\beta$

Derivation for $\beta$ : $\beta = (X^TX)^{-1}X^Ty$

Pseudo Inverse

For matrices that are not invertible (or poorly conditioned), the Moore-Penrose pseudo-inverse can be used to solve least squares problems. Given matrix $A$ : $A^+ = (A^T A)^{-1} A^T$

This allows us to compute a solution $x$ for a system $Ax = b$ using: $x = A^+ b$

Eigendecomposition

Given a square matrix $A$ , its eigendecomposition represents it in terms of its eigenvalues and eigenvectors. Let $\lambda$ be an eigenvalue and $v$ its corresponding eigenvector, then: $Av = \lambda v$

The matrix can be represented as: $A = V \Lambda V^{-1}$ Where:

$V$ is the matrix whose columns are the eigenvectors of $A$
$\Lambda$ is a diagonal matrix with eigenvalues on the diagonal.

Singular Value Decomposition (SVD)

Any matrix $A$ of shape (m x n) can be decomposed as: $A = U \Sigma V^T$ Where:

$U$ (m x m) are the left singular vectors (eigenvectors of $AA^T$ )
$V$ (n x n) are the right singular vectors (eigenvectors of $A^T A$ )
$\Sigma$ is a diagonal matrix containing the singular values, which are the square roots of the non-zero eigenvalues of $AA^T$ or $A^T A$ .