Homework 7. Linear Model Selection and Regularization: House Prices

Introduction

This assignment is based on the House Prices: Advanced Regression Techniques Kaggle competition. The goal is to predict the sale price of a house based on its features.


Data Preprocessing

Load Data

train = pd.read_csv('House Prices.csv')

Drop unnecessary columns

train.drop("Id", axis = 1, inplace = True)

One-hot encode categorical variables

train = pd.get_dummies(train)

Impute missing values

train.replace('NaN', np.nan, inplace=True)
train = train.fillna(train.median())

Make sure there are no missing values left:

assert train.isnull().sum().sum() == 0

Split data

from sklearn.model_selection import train_test_split

X = train.drop(['SalePrice'], 1)
y = train.SalePrice

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

Simple Linear Regression

Fit and Predict

from sklearn.linear_model import LinearRegression

regr = LinearRegression()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)

Evaluate

from sklearn.metrics import mean_squared_error

meanSquaredError=mean_squared_error(y_test, y_pred)
rootMeanSquaredError = np.sqrt(meanSquaredError)
print("RMSE:", rootMeanSquaredError)

The result RMSE is about 23658, quite a large error. Let's see how regularization can help.


Regularization

Lasso Regression

from sklearn import linear_model

lasso = linear_model.Lasso(alpha = 0.000001)
lasso.fit(X_train, y_train)
y_pred = lasso.predict(X_test)

meanSquaredError=mean_squared_error(y_test, y_pred)
rootMeanSquaredError = np.sqrt(meanSquaredError)
print("RMSE:", rootMeanSquaredError)

The result RMSE is about 23839, still a large error. Let's try different values of alpha.

alpha_vals = np.arange(0.01, 5, .01)

iter_coefs = pd.DataFrame(columns = ['alpha', 'col', 'coef'])
iter_train_perf = []
iter_test_perf = []

for alpha in alpha_vals:
    clf = linear_model.Lasso(alpha = alpha)
    clf.fit(X_train, y_train)
    y_hat_train = clf.predict(X_train)
    y_hat_test = clf.predict(X_test)
    rmse_train = mean_squared_error(y_train, y_hat_train) ** 0.5
    rmse_test = mean_squared_error(y_test, y_hat_test) ** 0.5
    iter_train_perf.append(rmse_train)
    iter_test_perf.append(rmse_test)
    dd = pd.DataFrame({'col': X_train.columns, 'coef': clf.coef_})
    dd['alpha'] = alpha
    iter_coefs = iter_coefs.append(dd, sort = False)

Unfortunately, the RMSE is still quite large even with different values of alpha. Let's see if Ridge Regression can do better.

rmse_vs_alpha_lasso

Ridge Regression

from sklearn import linear_model

alpha_vals = np.arange(0.1, 200, 1)

iter_coefs = pd.DataFrame(columns = ['alpha', 'col', 'coef'])
iter_train_perf = []
iter_test_perf = []

for alpha in alpha_vals:
    clf = linear_model.Ridge(alpha = alpha)
    clf.fit(X_train, y_train)
    y_hat_train = clf.predict(X_train)
    y_hat_test = clf.predict(X_test)
    rmse_train = mean_squared_error(y_train, y_hat_train) ** 0.5
    rmse_test = mean_squared_error(y_test, y_hat_test) ** 0.5
    iter_train_perf.append(rmse_train)
    iter_test_perf.append(rmse_test)
    dd = pd.DataFrame({'col': X_train.columns, 'coef': clf.coef_})
    dd['alpha'] = alpha
    iter_coefs = iter_coefs.append(dd, sort = False)

rmse_vs_alpha_ridge

Although the trend is clearer, the RMSE is still quite large (in fact, it's even larger than Lasso Regression).


Discussion

We tried to predict house prices using a Kaggle dataset. First, we cleaned up the data by removing unnecessary columns, changing categories to numbers, and filling in missing values. We started with a basic linear model, but the error was quite high (RMSE around 23658).

To improve, we used Lasso regression, adjusting its strength (alpha). But, the error remained high across different alphas. Ridge regression was next, and while we could see some patterns when changing alpha, the error was still too high — sometimes even worse than with Lasso.

In short, neither Lasso nor Ridge gave us better results than the simple model. Future steps could include tweaking the features, trying different models, or combining models together.