Credit Card Frauds’ Descriptive and Predictive Analytics using ‘Tableau’ and ‘Python’

Jayswal Journals

Credit Analysis, Credit Risk, Data Analytics, Machine Learning, Python, Tableau

Credit Card Frauds’ Descriptive and Predictive Analytics using ‘Tableau’ and ‘Python’

In the below Tableau worksheets I have conducted Credit Card Fraud analysis on United States Credit fraud Kaggle dataset collected over four days – June 20 to June 24 – in 2020. Data Cleaning Since the original dataset included 555,720 rows, I sampled it in stratified manner to reduce it down to 10% i.e., 55,572…

Shubham Jayswal

2–3 minutes

In the below Tableau worksheets I have conducted Credit Card Fraud analysis on United States Credit fraud Kaggle dataset collected over four days – June 20 to June 24 – in 2020.

Data Cleaning

Since the original dataset included 555,720 rows, I sampled it in stratified manner to reduce it down to 10% i.e., 55,572 rows of the original ensuring that the percentage of fraud and non-fraud classification was conserved.

import pandas as pd
from sklearn.model_selection import train_test_split

# Step 1: Load your dataset
data = pd.read_excel('CreditCardDatasetNew.xlsx')

# Step 2: Calculate 10% of the data
sample_size = int(len(data) * 0.1)

# Step 3: Perform stratified sampling to get 10% of the rows
sampled_data, _ = train_test_split(data, train_size=sample_size, stratify=data['is_fraud'], random_state=42)

# Step 4: Save the sampled dataset to a new Excel file
sampled_data.to_excel('sampled_dataset_10_percent.xlsx', index=False)

In the sampled data, there were no redundant rows. However, there were some rows that had ‘State’ data missing. I filled them using the city name to which they pertain. The age variable also didn’t have any unreasonable ages such as more than a 100.

Creating Tableau Dashboard

Once that was done, I created the following graphs

Heat map for the number of Credit Card Frauds by Age group & Gender
Histogram for Credit Card Fraud amount grouped by State and Age (bin: 5)
Credit Card Frauds grouped by Age group
Top 20 Credit Card Frauds grouped by Profession
Top 20 Merchants involved in Credit Card Frauds.

Below is the tableau Dashboard for the same. One thing to note though together with the descriptive analytics below is that how significant are the descriptive variables plotted. I have conducted predictive analytics of the original dataset to gauge the same. Read below for the results.

Resolving data imbalance by oversampling the minority class and undersampling the majority class

The original dataset had less than 10% is_fraud variable equaling 1 which made it unbalanced. I sampled the original data this time in such a way that 50% of my dataset had is_fraud field equalling 1 and the other 50% had 0. Simply put I reduced the samples that had a 0 class and increased kept the 1’s intact.

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               is_fraud   R-squared:                       0.666
Model:                            OLS   Adj. R-squared:                  0.615
Method:                 Least Squares   F-statistic:                     13.18
Date:                Thu, 13 Jun 2024   Prob (F-statistic):               0.00
Time:                        21:10:59   Log-Likelihood:                 1064.4
No. Observations:               12011   AIC:                             1025.
Df Residuals:                   10434   BIC:                         1.268e+04
Df Model:                        1576                                         
Covariance Type:            nonrobust

Predictive Analytics

Interpretation of the Ordinary Least Square Curve.

Are the variables plotted in the descriptive statistics significant?

Age: Coefficient: 0.0022; p-value: 0.0057

Gender: Coefficient: 0.0022; p-value: 0.0057

Amount: Coefficient: 0.0022; p-value: 0.0057

Merchant: Multiple Merchants showing p<0.05. Detailed description in attached output.

State: City could be considered proxy for state variable. Many cities had p<0.05 indicating state is also a good variable to use in the descriptive analytics.

Machine learning models and their accuracy

Random Forest and Gradient Boosting were the best models with highest accuracy, precision and recall.

AUROC Curve

In my research I have found logistic regression to be least effective. It may provide high accuracy levels but presents poor recall levels as depicted by the curve above. Random forest and Gradient boosting perform the best in terms of the recall levels.