7 Data Cleaning Hacks Using SQL Every Data Science Professional Needs to Know

4 min readAug 14, 2024

Data cleaning is a crucial step in the data science workflow, directly influencing the accuracy of models and the quality of insights drawn from the data. Effective data cleaning ensures that the data is accurate, consistent, and ready for analysis. This article highlights seven essential data cleaning hacks that every data science professional should master. These hacks leverage SQL functions to handle common data issues, from duplicates to inconsistent formats.

Remove Duplicates — Use the `ROW_NUMBER()` Window Function

Duplicate records can distort your analysis, leading to misleading results. The ROW_NUMBER() window function is a powerful tool for identifying and removing duplicates.

How It Works:

The ROW_NUMBER() function assigns a unique sequential integer to rows within a partition of the result set, ordered by a specified column. By partitioning the data based on columns that should be unique and ordering it by another column (e.g., date), you can easily identify duplicates.

WITH RankedTransactions AS (
    SELECT 
        transaction_id,
        customer_id,
        product_id,
        transaction_date,
        amount,
        ROW_NUMBER() OVER(PARTITION BY transaction_id ORDER BY transaction_date DESC) AS rn
    FROM 
        transactions
)
DELETE FROM RankedTransactions WHERE rn > 1;

This query keeps only the first occurrence of each duplicate record, removing the rest.

Handling Missing Data — Replace Missing Values Using `COALESCE()`

Missing data is a common issue in datasets and can lead to inaccuracies in analysis if not handled properly. The COALESCE() function is a simple yet effective method for dealing with missing values.

How It Works:

COALESCE() returns the first non-null value in the list of arguments. It can be used to replace missing values with a default value, another column's value, or the result of a subquery.

SELECT 
    customer_id,
    COALESCE(email, 'noemail@domain.com') AS cleaned_email,
    first_name,
    last_name
FROM 
    customers;

This query replaces all NULL values in column_with_missing_values with default_value.

Correct Data Formats — Use `CAST()` or `CONVERT()` to Standardize Formats

Inconsistent data formats, particularly with dates, can cause errors in analysis and reporting. Using CAST() or CONVERT() functions allows you to standardize data formats efficiently.

How It Works:

CAST() and CONVERT() are used to transform data from one type to another. These functions are especially useful for converting string representations of dates into actual date data types.

SELECT 
    order_id,
    CAST(order_date AS DATE) AS standardized_order_date,
    customer_id,
    amount
FROM 
    orders;

This query converts a string date like ‘2024–08–14’ into a proper DATE type.

Fix Inconsistent Data — Use `CASE` Statements to Standardize Values

Inconsistent data entries can lead to confusion and inaccuracies, especially when they represent the same concept. The CASE statement is a flexible tool for standardizing these values.

How It Works:

A CASE statement allows you to evaluate conditions and return specific values based on the results, making it ideal for correcting inconsistencies within a column.

 SELECT 
    product_id,
    CASE 
        WHEN category = 'Electr' THEN 'Electronics'
        WHEN category = 'Elec' THEN 'Electronics'
        WHEN category = 'Cloth' THEN 'Clothing'
        ELSE category
    END AS standardized_category,
    product_name,
    price
FROM 
    products;

Remove Unwanted Characters — Use `REPLACE()` to Clean Strings

Unwanted characters in string data can interfere with analysis and visualization. The REPLACE() function provides a straightforward way to clean strings.

How It Works:

REPLACE() searches for a specified substring within a string and replaces it with another substring (or removes it entirely).

SELECT 
    customer_id,
    REPLACE(REPLACE(phone_number, '-', ''), ' ', '') AS cleaned_phone_number,
    first_name,
    last_name
FROM 
    customers;

This query removes hyphens from phone numbers or other string data.

Split & Extract Data — Use `SUBSTRING_INDEX()` or `SPLIT_PART()` for String Manipulation

Sometimes, data stored in a single column needs to be split or extracted into different parts. Functions like SUBSTRING_INDEX() (MySQL) or SPLIT_PART() (PostgreSQL) are perfect for this task.

How It Works:

SUBSTRING_INDEX() and SPLIT_PART() allow you to extract substrings based on a delimiter, making it easy to separate data into different components.

SELECT 
    customer_id,
    SUBSTRING_INDEX(email, '@', -1) AS email_domain
FROM 
    customers;

This query extracts the username part from an email address by splitting the string at the ‘@’ character.

Group and Aggregate for Cleaning — Use `GROUP BY` with Aggregation Functions

Grouping and aggregation not only help in summarizing data but can also be used to clean it. By combining GROUP BY with aggregation functions like SUM, AVG, MIN, MAX, and COUNT, you can clean and transform data effectively.

How It Works:

Aggregating data can help eliminate duplicates, find minimum or maximum values, or calculate averages, making the data more uniform and useful for analysis.

SELECT 
    product_id,
    CAST(transaction_date AS DATE) AS sales_date,
    SUM(amount) AS total_sales
FROM 
    transactions
GROUP BY 
    product_id, 
    CAST(transaction_date AS DATE);

This query groups data by a unique identifier and selects the most recent record, effectively removing duplicates.

Conclusion

Data cleaning is the cornerstone of any successful data science project. Without clean data, even the most sophisticated models and analyses can yield inaccurate or misleading results. The seven hacks discussed in this article provide powerful techniques to tackle common data quality issues, from removing duplicates to standardizing formats and handling missing values. By integrating these strategies into your data cleaning workflow, you can transform messy, unreliable datasets into pristine, analysis-ready data.

These hacks are not just about fixing errors; they are about instilling a level of discipline and rigor in your data preparation process. Whether you’re a seasoned data scientist or just starting your journey, mastering these data cleaning techniques will significantly enhance the quality of your work, leading to more robust insights and better decision-making. As the saying goes, “Garbage in, garbage out” — with these hacks, you can ensure that what goes into your models is nothing short of gold.

7 Data Cleaning Hacks Using SQL Every Data Science Professional Needs to Know

Remove Duplicates — Use the `ROW_NUMBER()` Window Function

How It Works:

Handling Missing Data — Replace Missing Values Using `COALESCE()`

How It Works:

Correct Data Formats — Use `CAST()` or `CONVERT()` to Standardize Formats

How It Works:

Fix Inconsistent Data — Use `CASE` Statements to Standardize Values

How It Works:

Remove Unwanted Characters — Use `REPLACE()` to Clean Strings

How It Works:

Split & Extract Data — Use `SUBSTRING_INDEX()` or `SPLIT_PART()` for String Manipulation

How It Works:

Group and Aggregate for Cleaning — Use `GROUP BY` with Aggregation Functions

How It Works:

Conclusion

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Gunasekar Jabbala

No responses yet

7 Data Cleaning Hacks Using SQL Every Data Science Professional Needs to Know

Remove Duplicates — Use the ROW_NUMBER() Window Function

How It Works:

Handling Missing Data — Replace Missing Values Using COALESCE()

How It Works:

Correct Data Formats — Use CAST() or CONVERT() to Standardize Formats

How It Works:

Fix Inconsistent Data — Use CASE Statements to Standardize Values

How It Works:

Remove Unwanted Characters — Use REPLACE() to Clean Strings

How It Works:

Split & Extract Data — Use SUBSTRING_INDEX() or SPLIT_PART() for String Manipulation

How It Works:

Group and Aggregate for Cleaning — Use GROUP BY with Aggregation Functions

How It Works:

Conclusion

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Gunasekar Jabbala

No responses yet

Remove Duplicates — Use the `ROW_NUMBER()` Window Function

Handling Missing Data — Replace Missing Values Using `COALESCE()`

Correct Data Formats — Use `CAST()` or `CONVERT()` to Standardize Formats

Fix Inconsistent Data — Use `CASE` Statements to Standardize Values

Remove Unwanted Characters — Use `REPLACE()` to Clean Strings

Split & Extract Data — Use `SUBSTRING_INDEX()` or `SPLIT_PART()` for String Manipulation

Group and Aggregate for Cleaning — Use `GROUP BY` with Aggregation Functions