Forecasting credit defaults using machine learning algorithms in retail banking

This blog post will consider a whole pipeline of machine learning in an attempt to predict credit risk using different classifiers. This will aim to determine a better model to predict, between the many models, whether a borrower will default. In this regard, I will use a dataset from Kaggle, do some preprocessing on it, and then train various models to see how they will compare in performance. The models considered are Logistic Regression, Decision Tree, Gradient Boosting, AdaBoost, Random Forest, K-Nearest Neighbors, and a Neural Network.

1. Data Preparation

I load the data using pandas and do some cleaning by dropping duplicates and null rows. Dummies are created for categorical variables using one-hot encoding, and it is efficient for casting columns of boolean datatype to uint8.

df = pd.read_excel('CreditDataset.xlsx', sheet_name='credit_risk_dataset')
df = df.drop_duplicates()
df = df.dropna()
df = pd.get_dummies(df, drop_first=True)
bool_columns = df.select_dtypes(include=['bool']).columns
df[bool_columns] = df[bool_columns].astype('uint8')

2. Feature Selection Using OLS Regression

The next step is conducting initial feature selection, which I do using Ordinary Least Squares (OLS) regression. This is because the p-values obtained from the OLS model will show the significance of the variables. After that, the variables having p-values <0.05 get considered in the following models.

X = df.drop(dependent_var, axis=1)
y = df[dependent_var]
X = sm.add_constant(X)
model_ols = sm.OLS(y, X).fit()
significant_vars = significant_results.index.drop('const')
X_significant = df[significant_vars]

3. Model Training and Evaluation

I split the dataset into training and testing sets. Several machine learning models are initialized and trained on the training set. These models include:
– Logistic Regression
– Decision Tree
– Gradient Boosting
– AdaBoost
– Random Forest
– K-Nearest Neighbors
– Neural Network (MLPClassifier)
Performance metrics such as accuracy, precision, recall, F1 score, and ROC-AUC are collected for each model. Additionally, ROC and Precision-Recall curves are plotted to visualize the performance.

models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'AdaBoost': AdaBoostClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Neural Network': MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, random_state=42)
}

4. Results and Visualization

I split the data into training and test sets. I then initialized and trained several machine-learning models on the training set. Logistic Regression, Decision Tree, Gradient Boosting, AdaBoost, Random Forest, K-Nearest Neighbors, and Neural Network (MLPClassifier). All models are evaluated using the classification accuracy, precision, recall, f1-score, and ROC-AUC. I then plotted the ROC curve and Precision-Recall curve.

metrics_df = pd.DataFrame(metrics).T
plt.figure(figsize=(10, 8))
for name, (fpr, tpr) in roc_curves.items():
    plt.plot(fpr, tpr, lw=2, label=f'{name} (area = {metrics[name]["ROC-AUC"]:.2f})')
plt.legend(loc="lower right")
plt.show()

plt.figure(figsize=(10, 8))
for name, model in models.items():
    y_pred_prob = model.predict_proba(X_test)[:, 1]
    precision, recall, _ = precision_recall_curve(y_test, y_pred_prob)
    pr_auc = auc(recall, precision)
    plt.plot(recall, precision, lw=2, label=f'{name} (area = {pr_auc:.2f})')
plt.legend(loc="lower left")
plt.show()

fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(15, 15))
for ax, (name, model) in zip(axes.flatten(), models.items()):
    y_pred = model.predict(X_test)
    conf_matrix = confusion_matrix(y_test, y_pred)
    sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', ax=ax)
    ax.set_title(f'Confusion Matrix - {name}')
plt.tight_layout()
plt.show()

Results

Ordinary Least Squares

OLS Regression Results                            
==============================================================================
Dep. Variable:            loan_status   R-squared:                       0.357
Model:                            OLS   Adj. R-squared:                  0.356
Method:                 Least Squares   F-statistic:                     717.9
Date:                Sat, 08 Jun 2024   Prob (F-statistic):               0.00
Time:                        23:46:13   Log-Likelihood:                -8900.3
No. Observations:               28501   AIC:                         1.785e+04
Df Residuals:                   28478   BIC:                         1.804e+04
Df Model:                          22                                         
Covariance Type:            nonrobust

OLS key insights

Dependent variable: loan_status (1 indicates default, 0 indicates no default)

R-squared: 0.357

Approximately 35.7% of the variability in loan status is explained through independent variables in the model.

Adjusted R-squared: 0.356

Adjusted for the number of predictors in the model; still, at a moderate explanatory power consolidate.

F-statistic: 717.9 (p-value: 0.00)

The overall model is statistically significant.

OLS Interpretation

R-squared and Adjusted R-squared:

These values indicate that the model explains about 35.7% of the variance in the dependent variable (loan_status). While this is a moderate level of explanatory power, it suggests that other factors not included in the model also play a significant role in determining loan default status.

F-statistic:

The high F-statistic value and its associated p-value (0.00) indicate that the overall model is statistically significant, meaning that at least some of the predictors are significantly associated with loan default status.

Log-Likelihood: -8900.3

This measures how well the model fits the data, with higher values indicating a better fit.

Akaike Information Criterion (AIC): 1.785e+04

A lower AIC value indicates a better model fit when comparing multiple models.

Bayesian Information Criterion (BIC): 1.804e+04

Similar to AIC, but includes a penalty for the number of parameters in the model to prevent overfitting.

Coefficient, standard error, p and t values

===============================================================================================
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
const                          -0.0729      0.020     -3.590      0.000      -0.113      -0.033
person_age                     -0.0015      0.001     -2.468      0.014      -0.003      -0.000
person_income                3.261e-07   3.86e-08      8.452      0.000    2.51e-07    4.02e-07
person_emp_length              -0.0004      0.000     -0.829      0.407      -0.001       0.001
loan_amnt                   -1.379e-05   4.62e-07    -29.868      0.000   -1.47e-05   -1.29e-05
loan_int_rate                   0.0078      0.002      4.009      0.000       0.004       0.012
loan_percent_income             1.7801      0.027     65.836      0.000       1.727       1.833
cb_person_cred_hist_length      0.0016      0.001      1.640      0.101      -0.000       0.003
person_home_ownership_OTHER     0.0446      0.034      1.294      0.196      -0.023       0.112
person_home_ownership_OWN      -0.1279      0.008    -16.195      0.000      -0.143      -0.112
person_home_ownership_RENT      0.0887      0.004     19.984      0.000       0.080       0.097
loan_intent_EDUCATION          -0.0977      0.007    -14.781      0.000      -0.111      -0.085
loan_intent_HOMEIMPROVEMENT    -0.0013      0.008     -0.171      0.864      -0.016       0.014
loan_intent_MEDICAL            -0.0169      0.007     -2.514      0.012      -0.030      -0.004
loan_intent_PERSONAL           -0.0706      0.007    -10.312      0.000      -0.084      -0.057
loan_intent_VENTURE            -0.1123      0.007    -16.399      0.000      -0.126      -0.099
loan_grade_B                    0.0041      0.009      0.477      0.633      -0.013       0.021
loan_grade_C                    0.0237      0.014      1.749      0.080      -0.003       0.050
loan_grade_D                    0.3809      0.017     21.983      0.000       0.347       0.415
loan_grade_E                    0.4199      0.022     18.737      0.000       0.376       0.464
loan_grade_F                    0.4690      0.032     14.635      0.000       0.406       0.532
loan_grade_G                    0.7291      0.050     14.528      0.000       0.631       0.827
cb_person_default_on_file_Y     0.0006      0.007      0.092      0.927      -0.012       0.014

Significant Predictors (p-value < 0.05):

Intercept(const): -0.0729
- The base level of loan default probability when all predictors are zero.
person_age: -0.0015
- Older individuals have a slightly lower probability of default.
person_income: 3.261e-07
- Higher income is associated with a higher probability of default, though the effect size is very small.
loan_amnt: -1.379e-05
- Higher loan amounts are associated with a lower probability of default.
loan_int_rate: 0.0078
- Higher interest rates increase the probability of default.
loan_percent_income: 1.7801
- Loans that constitute a higher percentage of the individual’s income are more likely to default.
person_home_ownership_OWN: -0.1279
- Owning a home decreases the probability of default.
person_home_ownership_RENT: 0.0887
- Renting is associated with an increased probability of default.

Loan Intent:

loan_intent_EDUCATION: -0.0977
- Loans taken for education are less likely to default.
loan_intent_MEDICAL: -0.0169
- Loans taken for medical purposes are less likely to default.
loan_intent_PERSONAL: -0.0706
- Personal loans are less likely to default.
loan_intent_VENTURE: -0.1123
- Loans for ventures are less likely to default.
- loan_grade_D: 0.3809
  loan_grade_E: 0.4199
  loan_grade_F: 0.4690
  loan_grade_G: 0.7291
- Higher loan grades (D to G) significantly increase the probability of default.

Interpretation:

Negative Coefficients:

Older Age: Slightly decreases the likelihood of default, likely indicating more financial stability.
Higher Loan Amounts: Surprisingly lowers the probability of default, possibly indicating that larger loans are granted to more creditworthy individuals.
Home Ownership: Owning a home reduces the likelihood of default, reflecting better financial stability.
Specific Loan Intents: Loans for education, medical, personal, and venture purposes are less likely to default, indicating that these are considered more responsible or necessary expenditures.

Positive Coefficients:

Higher Income: Positively associated with default, although the effect size is minimal, suggesting other factors play a more significant role.
Higher Interest Rates: Increase the probability of default, likely due to higher financial burdens.
Higher Loan Percent of Income: Strongly increases the probability of default, indicating higher financial strain.
Renting: Associated with a higher probability of default, possibly reflecting less financial stability.
Higher Loan Grades (D to G): Significantly increase default probability, reflecting higher risk profiles for these loans.

Kurtosis and other descriptive statistics

==============================================================================
Omnibus:                     3401.416   Durbin-Watson:                   1.829
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4901.916
Skew:                           0.911   Prob(JB):                         0.00
Kurtosis:                       3.898   Cond. No.                     2.66e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.66e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

Diagnostic Statistics:

Omnibus Test:

Omnibus: 3401.416
Prob(Omnibus): 0.000
- The omnibus test examines skewness and kurtosis of the residuals. A significant p-value (0.000) concludes that residuals are not normally distributed.

Durbin-Watson Statistic:

Durbin-Watson: 1.829
- This test statistic checks for autocorrelation in the residuals. Values around 2 indicate no autocorrelation. The current value of 1.829 suggests very little autocorrelation.

Jarque-Bera (JB) Test:

Jarque-Bera (JB): 4901.916
Prob(JB): 0.00
- The JB test also assesses normality. The significant p-value (0.00) indicates that the residuals are not normally distributed, consistent with the omnibus test.

Skewness and Kurtosis:

Skew: 0.911
- A positive skew (0.911) indicates that the right tail of the distribution is longer or fatter than the left.
Kurtosis: 3.898
- A kurtosis value of 3.898 suggests that the distribution has heavier tails than a normal distribution (leptokurtic).

Condition Number:

Cond. No.: 2.66e+06
- The condition number measures multicollinearity. A high condition number (greater than 30) indicates potential multicollinearity issues. A condition number of 2.66e+06 suggests strong multicollinearity or other numerical problems.

Notes:

The standard errors assume that the covariance matrix of the errors is correctly specified.
The large condition number indicates potential multicollinearity issues, meaning some predictors may be highly correlated, making it difficult to isolate the individual effect of each predictor.

Interpretation:

Normality of Residuals:

Both the Omnibus and Jarque-Bera tests suggest that the residuals are not normally distributed. This violation of the normality assumption can affect the validity of hypothesis tests for coefficients.

Autocorrelation:

The Durbin-Watson statistic suggests little to no autocorrelation in the residuals, which is a good sign for the validity of the model.

Multicollinearity:

The extremely high condition number indicates significant multicollinearity issues. This can inflate the variances of the coefficient estimates and make the model unstable. It suggests that some predictors may need to be removed or combined.

Skewness and Kurtosis:

The positive skew and high kurtosis indicate that the residuals have a distribution with a heavy right tail and are more peaked than a normal distribution.

Independent variables whose p < 0.05

Significant variables with p-value < 0.05:
                            Coefficient  p-value
const                           -0.0729   0.0003
person_age                      -0.0015   0.0136
person_income                    0.0000   0.0000
loan_amnt                       -0.0000   0.0000
loan_int_rate                    0.0078   0.0001
loan_percent_income              1.7801   0.0000
person_home_ownership_OWN       -0.1279   0.0000
person_home_ownership_RENT       0.0887   0.0000
loan_intent_EDUCATION           -0.0977   0.0000
loan_intent_MEDICAL             -0.0169   0.0120
loan_intent_PERSONAL            -0.0706   0.0000
loan_intent_VENTURE             -0.1123   0.0000
loan_grade_D                     0.3809   0.0000
loan_grade_E                     0.4199   0.0000
loan_grade_F                     0.4690   0.0000
loan_grade_G                     0.7291   0.0000

Receiver Operating Characteristic (ROC) area under the curve

The ROC curve shows how the models perform in terms of classification. An ROC curve plots the true positive rate (sensitivity) against the false positive rate (1-specificity) at various threshold settings. Here are the key insights from the provided ROC curve:

Area Under the Curve (AUC):

Logistic Regression: AUC = 0.76
Decision Tree: AUC = 0.80
Gradient Boosting: AUC = 0.92
AdaBoost: AUC = 0.90
Random Forest: AUC = 0.92
K-Nearest Neighbors (KNN): AUC = 0.81
Neural Network: AUC = 0.75

Performance Comparison:

Gradient Boosting and Random Forest: Both models have the highest AUC (0.92), indicating they have the best performance in distinguishing between defaulters and non-defaulters.
AdaBoost: Also performs very well with an AUC of 0.90.
Decision Tree and KNN: These models have moderate performance with AUCs of 0.80 and 0.81, respectively.
Logistic Regression and Neural Network: These models have the lowest AUCs (0.76 and 0.75, respectively) among the compared models, indicating they are less effective in distinguishing between defaulters and non-defaulters compared to the other models.

True Positive Rate vs. False Positive Rate:

Gradient Boosting and Random Forest: These models consistently perform well across different thresholds, as indicated by their curves being closer to the top-left corner of the plot.
AdaBoost: Shows strong performance, with a curve that stays above the Logistic Regression, Decision Tree, and KNN curves for most of the threshold range.
Logistic Regression and Neural Network: Their curves are closer to the diagonal line (random guessing), suggesting they are less effective at distinguishing between classes.

Model Selection:

For a high-performing model, Gradient Boosting and Random Forest are the top choices, given their high AUC values and ROC curves close to the top-left corner.
AdaBoost is also a strong contender with a high AUC.
Decision Tree and KNN can be considered for their simplicity and interpretability, though they have lower AUCs.
Logistic Regression and Neural Network might be less preferable due to their lower AUCs in this context.

Receiver Operating Characteristic (ROC) area under the curve

Key insights from Confusion matrix

Confusion matrices help get a much more detailed understanding of how well the classification model could perform by giving the counts for true positives, true negatives, false positives, and false negatives. I have analyzed the confusion matrices for each:

Model Performance Metrics

Insights

Overall Insights

Precision-Recall Curve

Key insights from the Precision-Recall curve

The PR curve is a precision-versus-recall plot over different threshold values; this is useful mainly when dealing with highly imbalanced datasets. Following is the discussion around the PR curves where the given model is concerned.

Logistic Regression:

AUC or Area Under Curve 0.53

Insights:

Here comes logistic regression, with the lowest AUC among the models, implying that such a log lost the race for the precious recall sacrifice. The drop in precision is very steep with an increase in recall, indicating that it just can’t manage to retain the precision by making considerable leaps in Positives.

Decision Tree:

AUC: 0.72

Interpretation:

The Good The Decision Tree model performs better compared to the model Logistic Regression; the related AUC is high. This means that the curve presents a more balanced trade-off line between precision and recall than the Logistic Regression model; however, it still drops off as recall approaches significant values. The curve does remain relatively high for indicating good precision above an average number of recall values and hence balanced in precision against recall.

AdaBoost

AUC: 0.79

Conclusion:

AdaBoost performs well with a high AUC. The precision was relatively high above average numbers of recalls, but below an average number of recalls, it did not deliver enough power compared to GradientBoosting and Random Forests.

Random Forest:

AUC = 0.86

Interpretation:

Random Forest certainly takes the top position in the model by AUC since the model maximizes the gap between sensitivity and the false positive rate to the maximum. It is pretty easily seen because the curve stays throughout high, meaning that up to higher recall levels, the model maintains high precision.

K-Nearest Neighbors (KNN):

A0.65

Interpretation:

By all appearances, KNN is moderate in value given its AUC and slightly better when compared to Logistic Regression but worse in comparison to methods for combining. There exists a high prevalence of precision decrease with a low rate of recall increase throughout the curve, meaning the model does not hold water with higher recall values.

Neural Network:

AUC = 0.55

Insights:

The Neural Network provides an AUC slightly low; hence, it is nowhere close to these other ensemble methods in terms of trade-offs in precision and recall. It gets into the curve quickly with its precision increasing slightly, but it is worthless horrible in improving the recall when done.

Conclusion

Best Performers:

Random Forest and Gradient Boosting: These all have the highest AUC values and will make sure a high level of precision is secured across a wide range of recall levels; hence, they are our best performers for this dataset.

AdaBoost: It works great, shows a high AUC, and has a perfect balance of the levels of recall and precision.

Moderate performers:

Decision Tree: Reasonably balanced, but is vastly outperformed by the ensemble methods.

K-Nearest Neighbors: It is better than Logistic Regression but worse than the ensemble methods.

Worst:

Logistic Regression and Neural Network: Both have the smallest AUC and a significant drop in precision as recall goes up, meaning the balancing doesn’t work well.

Machine Learning Models Comparison

                    Accuracy  Precision    Recall  F1 Score   ROC-AUC
Logistic Regression  0.803017   0.743772  0.165873  0.271252  0.759613
Decision Tree        0.863182   0.693548  0.682540  0.688000  0.798487
Gradient Boosting    0.906508   0.893824  0.654762  0.755841  0.919139
AdaBoost             0.887739   0.819588  0.630952  0.713004  0.900211
Random Forest        0.911770   0.907427  0.669048  0.770215  0.922457
K-Nearest Neighbors  0.833889   0.664910  0.500794  0.571299  0.810651
Neural Network       0.780039   1.000000  0.004762  0.009479  0.748078

Conclusion

In the above study, I have developed a comprehensive machine learning pipeline to predict credit risk with a focus on identifying the most effective model that could be used to predict Churn. I have used a well-structured dataset available on Kaggle, which has been preprocessed thoroughly in terms of removing duplicates and missing values and converting categorical variables into dummy variables.
As the primary step, we used an OLS regression to select the features that were significant and stayed in; after that, I proceeded to train an entire battery of different machine learning models, from Logistic Regression and Decision Tree to Gradient Boosting and AdaBoost regressor, Random Forest, K-Nearest Neighbors, and a Neural Network, on 80% of the data and to train the model on 20% for testing.

For each of these critical metrics, such as accuracy, precision, recall, F1 score, and ROC-AUC, the results were collected and analyzed. I then visualized these metrics through ROC and Precision-Recall curves, displaying the confusion matrices.

The key insights my analysis lets. The ensemble methods of Gradient Boosting and Random Forest are the best to use in predicting loan defaulting. These models showed the best balance between precision and recall with strong specificities and sensitivities. Even though in their own right, Logistic Regression and Neural Networks were not as effective.

Their results mean a lot in providing an apt basis for choosing a model in credit risk prediction so that there will be better risk management and adequate decision-making in financial matters. For improving the models, we may reduce data imbalance between minority and majority variables using oversampling of minority and undersampling of majority variables.

No comments to show.

Jayswal Journals

Leave a ReplyCancel reply

Evaluating Banking Product Profitability with Hypothesis Testing to gauge Product’s Future Viability

Effective SQL Practices in Retail Banking: Data Management and Analytical Techniques

Impact of Economic Conditions on Credit Risk in retail banking (Upcoming)

Trending

Effective SQL Practices in Retail Banking: Data Management and Analytical Techniques

Impact of Economic Conditions on Credit Risk in retail banking (Upcoming)

US Federal Reserve’s Interest Rates and RBA’s Cash Rate synchronicity: Dollar Fluctuations and Economic Autonomy

Global Financial Crisis and Covid 19’s impact on RBA’s interest rate and Australian inflation