In the below Tableau worksheets I have conducted Credit Card Fraud analysis on United States Credit fraud Kaggle dataset collected over four days – June 20 to June 24 – in 2020.
Data Cleaning
Since the original dataset included 555,720 rows, I sampled it in stratified manner to reduce it down to 10% i.e., 55,572 rows of the original ensuring that the percentage of fraud and non-fraud classification was conserved.
import pandas as pd
from sklearn.model_selection import train_test_split
# Step 1: Load your dataset
data = pd.read_excel('CreditCardDatasetNew.xlsx')
# Step 2: Calculate 10% of the data
sample_size = int(len(data) * 0.1)
# Step 3: Perform stratified sampling to get 10% of the rows
sampled_data, _ = train_test_split(data, train_size=sample_size, stratify=data['is_fraud'], random_state=42)
# Step 4: Save the sampled dataset to a new Excel file
sampled_data.to_excel('sampled_dataset_10_percent.xlsx', index=False)
In the sampled data, there were no redundant rows. However, there were some rows that had ‘State’ data missing. I filled them using the city name to which they pertain. The age variable also didn’t have any unreasonable ages such as more than a 100.
Creating Tableau Dashboard
Once that was done, I created the following graphs
- Heat map for the number of Credit Card Frauds by Age group & Gender
- Histogram for Credit Card Fraud amount grouped by State and Age (bin: 5)
- Credit Card Frauds grouped by Age group
- Top 20 Credit Card Frauds grouped by Profession
- Top 20 Merchants involved in Credit Card Frauds.
Below is the tableau Dashboard for the same. One thing to note though together with the descriptive analytics below is that how significant are the descriptive variables plotted. I have conducted predictive analytics of the original dataset to gauge the same. Read below for the results.
Resolving data imbalance by oversampling the minority class and undersampling the majority class
The original dataset had less than 10% is_fraud variable equaling 1 which made it unbalanced. I sampled the original data this time in such a way that 50% of my dataset had is_fraud field equalling 1 and the other 50% had 0. Simply put I reduced the samples that had a 0 class and increased kept the 1’s intact.
OLS Regression Results
==============================================================================
Dep. Variable: is_fraud R-squared: 0.666
Model: OLS Adj. R-squared: 0.615
Method: Least Squares F-statistic: 13.18
Date: Thu, 13 Jun 2024 Prob (F-statistic): 0.00
Time: 21:10:59 Log-Likelihood: 1064.4
No. Observations: 12011 AIC: 1025.
Df Residuals: 10434 BIC: 1.268e+04
Df Model: 1576
Covariance Type: nonrobust
Predictive Analytics
Interpretation of the Ordinary Least Square Curve.
Are the variables plotted in the descriptive statistics significant?
Age: Coefficient: 0.0022; p-value: 0.0057
Gender: Coefficient: 0.0022; p-value: 0.0057
Amount: Coefficient: 0.0022; p-value: 0.0057
Merchant: Multiple Merchants showing p<0.05. Detailed description in attached output.
State: City could be considered proxy for state variable. Many cities had p<0.05 indicating state is also a good variable to use in the descriptive analytics.
Machine learning models and their accuracy

Random Forest and Gradient Boosting were the best models with highest accuracy, precision and recall.
AUROC Curve

In my research I have found logistic regression to be least effective. It may provide high accuracy levels but presents poor recall levels as depicted by the curve above. Random forest and Gradient boosting perform the best in terms of the recall levels.
Confusion Matrix

Precision-Recall Curve






