Homework 7. Linear Model Selection and Regularization: House Prices
Introduction
This assignment is based on the House Prices: Advanced Regression Techniques Kaggle competition. The goal is to predict the sale price of a house based on its features.
Data Preprocessing
Load Data
train = pd.read_csv('House Prices.csv')
Drop unnecessary columns
train.drop("Id", axis = 1, inplace = True)
One-hot encode categorical variables
train = pd.get_dummies(train)
Impute missing values
train.replace('NaN', np.nan, inplace=True)
train = train.fillna(train.median())
Make sure there are no missing values left:
assert train.isnull().sum().sum() == 0
Split data
from sklearn.model_selection import train_test_split
X = train.drop(['SalePrice'], 1)
y = train.SalePrice
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
Simple Linear Regression
Fit and Predict
from sklearn.linear_model import LinearRegression
regr = LinearRegression()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
Evaluate
from sklearn.metrics import mean_squared_error
meanSquaredError=mean_squared_error(y_test, y_pred)
rootMeanSquaredError = np.sqrt(meanSquaredError)
print("RMSE:", rootMeanSquaredError)
The result RMSE is about 23658, quite a large error. Let's see how regularization can help.
Regularization
Lasso Regression
from sklearn import linear_model
lasso = linear_model.Lasso(alpha = 0.000001)
lasso.fit(X_train, y_train)
y_pred = lasso.predict(X_test)
meanSquaredError=mean_squared_error(y_test, y_pred)
rootMeanSquaredError = np.sqrt(meanSquaredError)
print("RMSE:", rootMeanSquaredError)
The result RMSE is about 23839, still a large error. Let's try different values of alpha.
alpha_vals = np.arange(0.01, 5, .01)
iter_coefs = pd.DataFrame(columns = ['alpha', 'col', 'coef'])
iter_train_perf = []
iter_test_perf = []
for alpha in alpha_vals:
clf = linear_model.Lasso(alpha = alpha)
clf.fit(X_train, y_train)
y_hat_train = clf.predict(X_train)
y_hat_test = clf.predict(X_test)
rmse_train = mean_squared_error(y_train, y_hat_train) ** 0.5
rmse_test = mean_squared_error(y_test, y_hat_test) ** 0.5
iter_train_perf.append(rmse_train)
iter_test_perf.append(rmse_test)
dd = pd.DataFrame({'col': X_train.columns, 'coef': clf.coef_})
dd['alpha'] = alpha
iter_coefs = iter_coefs.append(dd, sort = False)
Unfortunately, the RMSE is still quite large even with different values of alpha. Let's see if Ridge Regression can do better.
Ridge Regression
from sklearn import linear_model
alpha_vals = np.arange(0.1, 200, 1)
iter_coefs = pd.DataFrame(columns = ['alpha', 'col', 'coef'])
iter_train_perf = []
iter_test_perf = []
for alpha in alpha_vals:
clf = linear_model.Ridge(alpha = alpha)
clf.fit(X_train, y_train)
y_hat_train = clf.predict(X_train)
y_hat_test = clf.predict(X_test)
rmse_train = mean_squared_error(y_train, y_hat_train) ** 0.5
rmse_test = mean_squared_error(y_test, y_hat_test) ** 0.5
iter_train_perf.append(rmse_train)
iter_test_perf.append(rmse_test)
dd = pd.DataFrame({'col': X_train.columns, 'coef': clf.coef_})
dd['alpha'] = alpha
iter_coefs = iter_coefs.append(dd, sort = False)
Although the trend is clearer, the RMSE is still quite large (in fact, it's even larger than Lasso Regression).
Discussion
We tried to predict house prices using a Kaggle dataset. First, we cleaned up the data by removing unnecessary columns, changing categories to numbers, and filling in missing values. We started with a basic linear model, but the error was quite high (RMSE around 23658).
To improve, we used Lasso regression, adjusting its strength (alpha). But, the error remained high across different alphas. Ridge regression was next, and while we could see some patterns when changing alpha, the error was still too high — sometimes even worse than with Lasso.
In short, neither Lasso nor Ridge gave us better results than the simple model. Future steps could include tweaking the features, trying different models, or combining models together.