Homework 3: Classification: Credit Worthiness
Introduction
In this assignment, we're using the German Credit Dataset to predict a person's credit risk based on a set of attributes. The dataset consists of 1000 instances with 20 attributes. The attributes include:
- Class: Represents creditworthiness (the target variable we're trying to predict)
- Duration: Duration of the credit (in months)
- Amount: Loan amount (in Deutsche Marks (DM))
- InstallmentRatePercentage: Installment rate as a percentage of disposable income
- ResidenceDuration: Duration of residency in years
- Age: Age of the applicant
- NumberExistingCredits: Number of existing credit accounts
- NumberPeopleMaintenance: Number of people for whom the applicant is responsible
- Telephone: Indicates if a phone number is linked to the customer
- ForeignWorker: Indicates if the applicant is a foreign worker
- CheckingAccountStatus: Represents the balance in the checking account (in DM), categorized as:- CheckingAccountStatus.lt.0
- CheckingAccountStatus.0.to.200
- CheckingAccountStatus.gt.200
 
- CreditHistory: Provides information on the applicant's past credit behavior. Types include:- CreditHistory.ThisBank.AllPaid
- CreditHistory.PaidDuly
- CreditHistory.Delay
- CreditHistory.Critical
 
Data Preprocessing
Importing the Data
credit_train = pd.read_csv('credit_train.csv')
credit_test = pd.read_csv('credit_test.csv')
Exploratory Data Analysis
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.distplot(credit_train['Amount'])
plt.show()
Most people request amounts less than 4,000 DM.
pd.crosstab(credit_train['CreditHistory.Critical'], credit_train['Class']).plot(kind='bar',title="Relationship between Credit History and Credit Worthiness");
People with good credit histories are more likely to get approved for credit.
Binary Encoding
To feed the data into our model, we need to convert categorical data into a numeric format. Here, we're converting the Class column:
cleanup = {"Class": {"Bad": 0, "Good": 1}}
credit_train.replace(cleanup, inplace=True)
print(credit_train["Class"].head())
credit_test.replace(cleanup, inplace=True)
Logistic Regression
We'll run logistic regression using the statsmodels library. Our predictor is CreditHistory.Critical and the response variable is Class.
import statsmodels.discrete.discrete_model as sm
logit = sm.Logit(credit_train['Class'].values, credit_train['CreditHistory.Critical'].values)
# Fit the model
result = logit.fit()
print(result.summary2())
# Examine the confidence interval for each coefficient
print(result.conf_int())
While our coefficient is statistically significant, our model's fit isn't great, indicated by a negative pseudo R-squared value and high values for AIC and BIC.
Prediction and Evaluation
We predict class labels for the test set and use our print_metrics() function to display the model's performance:
y_probs = result.predict(X_test)
y_pred = np.where(y_probs > 0.5, 1, 0)
import sklearn.metrics as sklm
# Confusion matrix
conf = sklm.confusion_matrix(y_test, y_pred)
print('Confusion matrix:')
print('Score positive    Score negative')
print('Actual positive    %6d' % conf[0,0] + '             %5d' % conf[0,1])
print('Actual negative    %6d' % conf[1,0] + '             %5d' % conf[1,1])
print('')
# Accuracy
accuracy = sklm.accuracy_score(y_test, y_pred)
print('Accuracy: %0.2f' % accuracy)
print('')
# Precision, Recall, F1-Score
precision, recall, f1_score, _ = sklm.precision_recall_fscore_support(y_test, y_pred)
print('Positive metrics:')
print('Num case:   %6d' % _[0])
print('Precision:  %6.2f' % precision[0])
print('Recall:     %6.2f' % recall[0])
print('F1 Score:   %6.2f' % f1_score[0])
print('')
print('Negative metrics:')
print('Num case:   %6d' % _[1])
print('Precision:  %6.2f' % precision[1])
print('Recall:     %6.2f' % recall[1])
print('F1 Score:   %6.2f' % f1_score[1])
print('')
Our model's accuracy is 50%, which is no better than random guessing.
Use LDA to Predict Credit Worthiness
Althogh we know that CreditHistory.Critical is a significant predictor of creditworthiness, we can't use it alone to predict creditworthiness. We'll use LDA to create a linear combination of predictors that best separates the classes.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()
y_pred = lda.fit(x_train, y_train).predict(x_test)
The new accuracy is about 0.72, which is a significant improvement over our previous model.