Homework 3: Classification: Credit Worthiness
Introduction
In this assignment, we're using the German Credit Dataset to predict a person's credit risk based on a set of attributes. The dataset consists of 1000 instances with 20 attributes. The attributes include:
Class
: Represents creditworthiness (the target variable we're trying to predict)Duration
: Duration of the credit (in months)Amount
: Loan amount (in Deutsche Marks (DM))InstallmentRatePercentage
: Installment rate as a percentage of disposable incomeResidenceDuration
: Duration of residency in yearsAge
: Age of the applicantNumberExistingCredits
: Number of existing credit accountsNumberPeopleMaintenance
: Number of people for whom the applicant is responsibleTelephone
: Indicates if a phone number is linked to the customerForeignWorker
: Indicates if the applicant is a foreign workerCheckingAccountStatus
: Represents the balance in the checking account (in DM), categorized as:CheckingAccountStatus.lt.0
CheckingAccountStatus.0.to.200
CheckingAccountStatus.gt.200
CreditHistory
: Provides information on the applicant's past credit behavior. Types include:CreditHistory.ThisBank.AllPaid
CreditHistory.PaidDuly
CreditHistory.Delay
CreditHistory.Critical
Data Preprocessing
Importing the Data
credit_train = pd.read_csv('credit_train.csv')
credit_test = pd.read_csv('credit_test.csv')
Exploratory Data Analysis
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.distplot(credit_train['Amount'])
plt.show()
Most people request amounts less than 4,000 DM.
pd.crosstab(credit_train['CreditHistory.Critical'], credit_train['Class']).plot(kind='bar',title="Relationship between Credit History and Credit Worthiness");
People with good credit histories are more likely to get approved for credit.
Binary Encoding
To feed the data into our model, we need to convert categorical data into a numeric format. Here, we're converting the Class
column:
cleanup = {"Class": {"Bad": 0, "Good": 1}}
credit_train.replace(cleanup, inplace=True)
print(credit_train["Class"].head())
credit_test.replace(cleanup, inplace=True)
Logistic Regression
We'll run logistic regression using the statsmodels
library. Our predictor is CreditHistory.Critical
and the response variable is Class
.
import statsmodels.discrete.discrete_model as sm
logit = sm.Logit(credit_train['Class'].values, credit_train['CreditHistory.Critical'].values)
# Fit the model
result = logit.fit()
print(result.summary2())
# Examine the confidence interval for each coefficient
print(result.conf_int())
While our coefficient is statistically significant, our model's fit isn't great, indicated by a negative pseudo R-squared value and high values for AIC and BIC.
Prediction and Evaluation
We predict class labels for the test set and use our print_metrics()
function to display the model's performance:
y_probs = result.predict(X_test)
y_pred = np.where(y_probs > 0.5, 1, 0)
import sklearn.metrics as sklm
# Confusion matrix
conf = sklm.confusion_matrix(y_test, y_pred)
print('Confusion matrix:')
print('Score positive Score negative')
print('Actual positive %6d' % conf[0,0] + ' %5d' % conf[0,1])
print('Actual negative %6d' % conf[1,0] + ' %5d' % conf[1,1])
print('')
# Accuracy
accuracy = sklm.accuracy_score(y_test, y_pred)
print('Accuracy: %0.2f' % accuracy)
print('')
# Precision, Recall, F1-Score
precision, recall, f1_score, _ = sklm.precision_recall_fscore_support(y_test, y_pred)
print('Positive metrics:')
print('Num case: %6d' % _[0])
print('Precision: %6.2f' % precision[0])
print('Recall: %6.2f' % recall[0])
print('F1 Score: %6.2f' % f1_score[0])
print('')
print('Negative metrics:')
print('Num case: %6d' % _[1])
print('Precision: %6.2f' % precision[1])
print('Recall: %6.2f' % recall[1])
print('F1 Score: %6.2f' % f1_score[1])
print('')
Our model's accuracy is 50%, which is no better than random guessing.
Use LDA to Predict Credit Worthiness
Althogh we know that CreditHistory.Critical
is a significant predictor of creditworthiness, we can't use it alone to predict creditworthiness. We'll use LDA to create a linear combination of predictors that best separates the classes.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()
y_pred = lda.fit(x_train, y_train).predict(x_test)
The new accuracy is about 0.72, which is a significant improvement over our previous model.