Homework 2: Linear Regression: Forest Fires
Introduction
In this assignment, we'll use linear regression to predict the area of damage from forest fires in the northeast region of Portugal. We are using the dataset from the UCI Machine Learning Repository. The dataset comprises 517 instances with 13 attributes. These attributes are as follows:
- X - x-axis spatial coordinate within the Montesinho park map: ranging from 1 to 9.
- Y - y-axis spatial coordinate within the Montesinho park map: ranging from 2 to 9.
- month - the month of the year: ranging from "jan" to "dec".
- day - the day of the week: from "mon" to "sun".
- FFMC - Fine Fuel Moisture Code index from the FWI system: 18.7 to 96.20.
- DMC - Duff Moisture Code index from the FWI system: 1.1 to 291.3.
- DC - Drought Code index from the FWI system: 7.9 to 860.6.
- ISI - Initial Spread Index from the FWI system: 0.0 to 56.10.
- temp - temperature in Celsius degrees: 2.2 to 33.30.
- RH - relative humidity in %: ranging from 15.0 to 100.
- wind - wind speed in km/h: 0.40 to 9.40.
- rain - external rain in mm/m2: 0.0 to 6.4.
- area - the burned area of the forest (in ha): 0.00 to 1090.84. Note: this output variable is highly skewed towards 0.0; modeling with the logarithm transform might be beneficial.
Data Preprocessing
Loading the Data
We start by loading the dataset:
forestfires = pd.read_csv('forestfires.csv')
Log Transformation of Area
To address the skewness towards 0.0 in the 'area' column, we'll apply a log transformation:
forestfires['area'] = np.log10(forestfires['area'] + 1)
Data Visualization
To visualize relationships among variables, we use seaborn.pairplot
:
import seaborn as sns
sns.pairplot(forestfires)
The resulting visualization contains an array of scatter plots, which reveal several insights:
- The area is notably skewed towards 0.0.
- There's a positive correlation between the area and attributes like temperature (temp), wind speed (wind), FFMC, and DC.
- Presence of rain almost always indicates no fire.
Feature Selection and Data Split
We'll select certain features for our regression and split the data into training and test sets:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
X = forestfires[['rain','wind']]
y = forestfires['area']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state=42)
Applying Linear Regression
We use the Linear Regression model for training:
from sklearn.linear_model import LinearRegression
regr = LinearRegression()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
Model Evaluation
To evaluate our model's performance, we'll compute the Mean Squared Error (MSE):
print(mean_squared_error(y_test, y_pred))
The obtained MSE is approximately 0.39. It's worth noting that introducing additional features didn't improve the MSE in this instance.
Incorporating Polynomial Features
Next, we'll test the performance with polynomial features:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
polyreg = make_pipeline(PolynomialFeatures(interaction_only=True, include_bias=False), LinearRegression())
polyreg.fit(X_train, y_train)
y_pred = polyreg.predict(X_test)
print(mean_squared_error(y_test, y_pred))
After incorporating polynomial features, the MSE slightly increased to around 0.49. Once again, adding these features didn't enhance the MSE.
Discussion
Here are some suggestions that could help in enhancing the model's predictive performance for your forest fire dataset:
-
Feature Engineering:
- Temporal Features: Consider deriving new features from the 'month' and 'day' attributes. For instance, you could transform the 'month' feature into a 'season' feature (e.g., Spring, Summer, Fall, Winter). This might better capture the cyclical nature of forest fires.
- Interaction Terms: Sometimes, the interaction between two features can be more informative than the features themselves. For example, the interaction between 'temp' and 'humidity' might be a powerful predictor.
-
Handling Categorical Variables:
- The 'month' and 'day' attributes are categorical. Make sure they're one-hot encoded or label encoded properly before feeding them into the model.
-
Feature Scaling:
- Standardize or normalize numeric features. Linear regression is sensitive to the scale of input features, and scaling them can help in improving the convergence of gradient-based optimization methods.
-
Address Multicollinearity:
- Check for multicollinearity among predictors using metrics like the Variance Inflation Factor (VIF). If found, consider dropping or combining correlated features.
-
Regularization:
- If you suspect that your model is overfitting, consider using a variation of linear regression that incorporates regularization, like Ridge or Lasso regression.
-
Data Augmentation:
- Sometimes, creating synthetic data points can help, especially if the dataset is unbalanced or if certain critical scenarios are underrepresented.
-
Resampling the Target Variable:
- Since the 'area' variable is highly skewed, in addition to log transformation, you might also consider other transformations like square root or Box-Cox to see if they yield better results.
-
Feature Selection:
- Use methods like recursive feature elimination, forward selection, or Lasso to identify and retain only the most predictive features.
-
Cross-Validation:
- Instead of a simple train-test split, use k-fold cross-validation. This helps in ensuring that the model performs well on different subsets of the data and gives a more generalized performance metric.
-
Ensemble Methods:
- Sometimes, combining predictions from multiple models can result in better performance. Techniques like bagging or boosting can be explored.
-
Alternative Models:
- While linear regression is a good starting point, there might be non-linear relationships in your data that it cannot capture. You might want to explore other regression techniques like Decision Trees, Random Forests, Gradient Boosted Trees, or Neural Networks.
-
Domain Knowledge:
- Incorporate any available domain knowledge about forest fires. For example, there might be known variables or factors, not in the current dataset, that affect the spread of forest fires. If possible, try to collect and incorporate such data.