Effective SQL Practices in Retail Banking: Data Management and Analytical Techniques

In retail banking, SQL is imperative when working on relational databases to access customers’ data and convert it into a usable form ready for analysis. SQL is primarily used for data wrangling which includes data cleaning and ensuring that the dataset makes sense. In that, it is used to remove missing values, redundant rows in a table, concatenate columns, perform necessary joins on tables (as there would be numerous tables in the database containing wealth of information), group data by field, sort data, and much more. Once the data is cleaned and ready for analytics using SQL, it can be accessed via a programming language such as Python or R, or a visualisation tool such as Tableau to perform descriptive and predictive analytics using statistics or machine learning models.

Talking about retail banking databases, a typical mid to large bank would have a Data Warehouse and a Datalake. Difference between the two is that the former contains structured data such as relational databases, tables, spreadsheets while the latter could contain structured, unstructured and semi structured data such as images, json script, and logs. Usually, the Data Warehouse provider enables SQL queries to run on its databases. One such Data Warehouse provider is Teradata, which provides cloud database and analytics-related software, products, and service.

Some fields that a retail banking database may include are customer first name, last name, customer address, credit score, products bought, collateral, guarantor, loan grade, interest rate, bank branch, bank address, customer gender, and probability of default (if automated).

Further to this, I will cover how we can access the desired dataset (new cleaned table) from the database to perform further analysis using Python.

Let me take you through the SQL code for the following in a relational database –

Remove missing values from a table

Ahead of removing the missing rows, it’s best to create a new table so that we do not alter the original table. Below is the code to create a new table called ‘clean_customers’; and let’s say original table is ‘customers’.

CREATE TABLE clean_customers (
    customer_id INT,
    first_name VARCHAR(100),
    last_name VARCHAR(100),
    address VARCHAR(255),
    credit_score INT,
    -- Add other necessary fields
);

Next to insert all the non-null rows into clean_customers table, the following code could be used

INSERT INTO clean_customers
SELECT *
FROM customers
WHERE first_name IS NOT NULL
  AND last_name IS NOT NULL
  AND address IS NOT NULL
  AND credit_score IS NOT NULL;

Find the number of redundant rows in a table

Next on, assuming ‘customer_id’ is the unique identifier (primary key), let’s count the number of redundant rows.

SELECT customer_id, COUNT(*)
FROM customers
GROUP BY customer_id
HAVING COUNT(*) > 1;

Get unique rows

Another way to get unique rows from the original table ‘customers’ and store it in another table ‘clean_customers’

CREATE TABLE unique_customers AS
SELECT DISTINCT *
FROM customers;

Perform joins (for illustration purposes, I have shown how to do left join in SQL)

Let’s assume we have two tables: ‘customers’ and ‘accounts’.

‘customers’ table:

‘accounts’ table:

Left Join will return all the rows from the customers table and only those rows from accounts table whose customer_id is present in the customers table.

SELECT c.customer_id, c.first_name, c.last_name, a.account_id, a.account_type, a.balance
FROM customers c
LEFT JOIN accounts a ON c.customer_id = a.customer_id;

Result

Group the data by a field

If we need to group by bank_branch to get the count of customers per branch

SELECT bank_branch, COUNT(*) AS customer_count
FROM customers
GROUP BY bank_branch;

Sort data

Below is the code for sorting in descending order

SELECT * FROM customers
ORDER BY credit_score DESC;

Create a calculated field based on other fields

Assuming we want to calculate the total loan amount as loan-amount * interest_rate.

SELECT *, (loan_amount * interest_rate) AS total_loan_amount
FROM loans;

Accessing the dataset using Python

Assuming that desired_customers is our final table, below is the code to extract the table in Python using sqlalchemy library. One could also download the table into csv format and then extract that into a dataframe in Python.

import pandas as pd
import sqlalchemy

# Create an engine to connect to the database
engine = sqlalchemy.create_engine('database_connection_string')

# Query the cleaned data
df = pd.read_sql_query('SELECT * FROM desired_customers', engine)

# Perform further analysis
print(df.describe())

Below is the code for accessing a Teradata cloud database table as a dataframe in Python, just in case you wanted to know how we do that. Teradata uses Teradatasql library that has to be imported first.

import teradatasql
import pandas as pd

# Replace with your actual Teradata connection details
host = 'your_teradata_host'
user = 'your_username'
password = 'your_password'
database = 'your_database'

# Create a connection string
connection_string = f'{{"host":"{host}","user":"{user}","password":"{password}"}}'

# Establish connection
conn = teradatasql.connect(connection_string)

# Replace 'clean_customers' with your actual table name
query = "SELECT * FROM clean_customers"

# Load data into a DataFrame
df = pd.read_sql(query, conn)

# Display the DataFrame
print(df.head())

# Close the connection
conn.close()

Hopefully, you can hit the ground running taking the above information as a reference to start doing some hands on analysis on retail banking data.

Contact me using the below form –

No comments to show.

Jayswal Journals

Leave a ReplyCancel reply

Evaluating Banking Product Profitability with Hypothesis Testing to gauge Product’s Future Viability