7 Data Cleaning Hacks Using SQL Every Data Science Professional Needs to Know
Data cleaning is a crucial step in the data science workflow, directly influencing the accuracy of models and the quality of insights drawn from the data. Effective data cleaning ensures that the data is accurate, consistent, and ready for analysis. This article highlights seven essential data cleaning hacks that every data science professional should master. These hacks leverage SQL functions to handle common data issues, from duplicates to inconsistent formats.

Remove Duplicates — Use the ROW_NUMBER()
Window Function
Duplicate records can distort your analysis, leading to misleading results. The ROW_NUMBER()
window function is a powerful tool for identifying and removing duplicates.
How It Works:
The ROW_NUMBER()
function assigns a unique sequential integer to rows within a partition of the result set, ordered by a specified column. By partitioning the data based on columns that should be unique and ordering it by another column (e.g., date), you can easily identify duplicates.
WITH RankedTransactions AS (
SELECT
transaction_id,
customer_id,
product_id,
transaction_date,
amount,
ROW_NUMBER() OVER(PARTITION BY transaction_id ORDER BY transaction_date DESC) AS rn
FROM
transactions
)
DELETE FROM RankedTransactions WHERE rn > 1;
This query keeps only the first occurrence of each duplicate record, removing the rest.
Handling Missing Data — Replace Missing Values Using COALESCE()
Missing data is a common issue in datasets and can lead to inaccuracies in analysis if not handled properly. The COALESCE()
function is a simple yet effective method for dealing with missing values.
How It Works:
COALESCE()
returns the first non-null value in the list of arguments. It can be used to replace missing values with a default value, another column's value, or the result of a subquery.
SELECT
customer_id,
COALESCE(email, 'noemail@domain.com') AS cleaned_email,
first_name,
last_name
FROM
customers;
This query replaces all NULL
values in column_with_missing_values
with default_value
.
Correct Data Formats — Use CAST()
or CONVERT()
to Standardize Formats
Inconsistent data formats, particularly with dates, can cause errors in analysis and reporting. Using CAST()
or CONVERT()
functions allows you to standardize data formats efficiently.
How It Works:
CAST()
and CONVERT()
are used to transform data from one type to another. These functions are especially useful for converting string representations of dates into actual date data types.
SELECT
order_id,
CAST(order_date AS DATE) AS standardized_order_date,
customer_id,
amount
FROM
orders;
This query converts a string date like ‘2024–08–14’ into a proper DATE
type.
Fix Inconsistent Data — Use CASE
Statements to Standardize Values
Inconsistent data entries can lead to confusion and inaccuracies, especially when they represent the same concept. The CASE
statement is a flexible tool for standardizing these values.
How It Works:
A CASE
statement allows you to evaluate conditions and return specific values based on the results, making it ideal for correcting inconsistencies within a column.
SELECT
product_id,
CASE
WHEN category = 'Electr' THEN 'Electronics'
WHEN category = 'Elec' THEN 'Electronics'
WHEN category = 'Cloth' THEN 'Clothing'
ELSE category
END AS standardized_category,
product_name,
price
FROM
products;
Remove Unwanted Characters — Use REPLACE()
to Clean Strings
Unwanted characters in string data can interfere with analysis and visualization. The REPLACE()
function provides a straightforward way to clean strings.
How It Works:
REPLACE()
searches for a specified substring within a string and replaces it with another substring (or removes it entirely).
SELECT
customer_id,
REPLACE(REPLACE(phone_number, '-', ''), ' ', '') AS cleaned_phone_number,
first_name,
last_name
FROM
customers;
This query removes hyphens from phone numbers or other string data.
Split & Extract Data — Use SUBSTRING_INDEX()
or SPLIT_PART()
for String Manipulation
Sometimes, data stored in a single column needs to be split or extracted into different parts. Functions like SUBSTRING_INDEX()
(MySQL) or SPLIT_PART()
(PostgreSQL) are perfect for this task.
How It Works:
SUBSTRING_INDEX()
and SPLIT_PART()
allow you to extract substrings based on a delimiter, making it easy to separate data into different components.
SELECT
customer_id,
SUBSTRING_INDEX(email, '@', -1) AS email_domain
FROM
customers;
This query extracts the username part from an email address by splitting the string at the ‘@’ character.
Group and Aggregate for Cleaning — Use GROUP BY
with Aggregation Functions
Grouping and aggregation not only help in summarizing data but can also be used to clean it. By combining GROUP BY
with aggregation functions like SUM
, AVG
, MIN
, MAX
, and COUNT
, you can clean and transform data effectively.
How It Works:
Aggregating data can help eliminate duplicates, find minimum or maximum values, or calculate averages, making the data more uniform and useful for analysis.
SELECT
product_id,
CAST(transaction_date AS DATE) AS sales_date,
SUM(amount) AS total_sales
FROM
transactions
GROUP BY
product_id,
CAST(transaction_date AS DATE);
This query groups data by a unique identifier and selects the most recent record, effectively removing duplicates.
Conclusion
Data cleaning is the cornerstone of any successful data science project. Without clean data, even the most sophisticated models and analyses can yield inaccurate or misleading results. The seven hacks discussed in this article provide powerful techniques to tackle common data quality issues, from removing duplicates to standardizing formats and handling missing values. By integrating these strategies into your data cleaning workflow, you can transform messy, unreliable datasets into pristine, analysis-ready data.
These hacks are not just about fixing errors; they are about instilling a level of discipline and rigor in your data preparation process. Whether you’re a seasoned data scientist or just starting your journey, mastering these data cleaning techniques will significantly enhance the quality of your work, leading to more robust insights and better decision-making. As the saying goes, “Garbage in, garbage out” — with these hacks, you can ensure that what goes into your models is nothing short of gold.