Homework 3: Classification: Credit Worthiness

Introduction

In this assignment, we're using the German Credit Dataset to predict a person's credit risk based on a set of attributes. The dataset consists of 1000 instances with 20 attributes. The attributes include:

Class: Represents creditworthiness (the target variable we're trying to predict)
Duration: Duration of the credit (in months)
Amount: Loan amount (in Deutsche Marks (DM))
InstallmentRatePercentage: Installment rate as a percentage of disposable income
ResidenceDuration: Duration of residency in years
Age: Age of the applicant
NumberExistingCredits: Number of existing credit accounts
NumberPeopleMaintenance: Number of people for whom the applicant is responsible
Telephone: Indicates if a phone number is linked to the customer
ForeignWorker: Indicates if the applicant is a foreign worker
CheckingAccountStatus: Represents the balance in the checking account (in DM), categorized as:
- CheckingAccountStatus.lt.0
- CheckingAccountStatus.0.to.200
- CheckingAccountStatus.gt.200
CreditHistory: Provides information on the applicant's past credit behavior. Types include:
- CreditHistory.ThisBank.AllPaid
- CreditHistory.PaidDuly
- CreditHistory.Delay
- CreditHistory.Critical

Data Preprocessing

Importing the Data

credit_train = pd.read_csv('credit_train.csv')
credit_test = pd.read_csv('credit_test.csv')

Exploratory Data Analysis

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.distplot(credit_train['Amount'])
plt.show()

Most people request amounts less than 4,000 DM.

pd.crosstab(credit_train['CreditHistory.Critical'], credit_train['Class']).plot(kind='bar',title="Relationship between Credit History and Credit Worthiness");

People with good credit histories are more likely to get approved for credit.

Binary Encoding

To feed the data into our model, we need to convert categorical data into a numeric format. Here, we're converting the Class column:

cleanup = {"Class": {"Bad": 0, "Good": 1}}
credit_train.replace(cleanup, inplace=True)
print(credit_train["Class"].head())
credit_test.replace(cleanup, inplace=True)

Logistic Regression

We'll run logistic regression using the statsmodels library. Our predictor is CreditHistory.Critical and the response variable is Class.

import statsmodels.discrete.discrete_model as sm
logit = sm.Logit(credit_train['Class'].values, credit_train['CreditHistory.Critical'].values)

# Fit the model
result = logit.fit()
print(result.summary2())

# Examine the confidence interval for each coefficient
print(result.conf_int())

While our coefficient is statistically significant, our model's fit isn't great, indicated by a negative pseudo R-squared value and high values for AIC and BIC.

Prediction and Evaluation

We predict class labels for the test set and use our print_metrics() function to display the model's performance:

y_probs = result.predict(X_test)
y_pred = np.where(y_probs > 0.5, 1, 0)

import sklearn.metrics as sklm

# Confusion matrix
conf = sklm.confusion_matrix(y_test, y_pred)
print('Confusion matrix:')
print('Score positive    Score negative')
print('Actual positive    %6d' % conf[0,0] + '             %5d' % conf[0,1])
print('Actual negative    %6d' % conf[1,0] + '             %5d' % conf[1,1])
print('')

# Accuracy
accuracy = sklm.accuracy_score(y_test, y_pred)
print('Accuracy: %0.2f' % accuracy)
print('')

# Precision, Recall, F1-Score
precision, recall, f1_score, _ = sklm.precision_recall_fscore_support(y_test, y_pred)

print('Positive metrics:')
print('Num case:   %6d' % _[0])
print('Precision:  %6.2f' % precision[0])
print('Recall:     %6.2f' % recall[0])
print('F1 Score:   %6.2f' % f1_score[0])
print('')

print('Negative metrics:')
print('Num case:   %6d' % _[1])
print('Precision:  %6.2f' % precision[1])
print('Recall:     %6.2f' % recall[1])
print('F1 Score:   %6.2f' % f1_score[1])
print('')

Our model's accuracy is 50%, which is no better than random guessing.

Use LDA to Predict Credit Worthiness

Althogh we know that CreditHistory.Critical is a significant predictor of creditworthiness, we can't use it alone to predict creditworthiness. We'll use LDA to create a linear combination of predictors that best separates the classes.

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis()
y_pred = lda.fit(x_train, y_train).predict(x_test)

The new accuracy is about 0.72, which is a significant improvement over our previous model.