Blog

Attain expertise that can facilitate growth of your organization.

city

By Kamil Jankowski | Business Intelligence Engineer

We like comments in programming. They help us understand the logic for the used definitions, names of the tickets created for new features or mark owners and contact persons for the projects. However, too many comments might negatively impact the code quality –  especially if not all of them are worth keeping. Some probably might be too personal to be published or include previous, no longer-used pieces of code. In this article, I will explain how to clean up SQL queries by using Python language and regular expressions.

What are regular expressions?

It is a well-known technique developed in theoretical computer science and formal language theory that began in the 1950s. Regular expressions are used by searching algorithms for “find” or “find and replace” operations that could be compared to CTRL + F function – but instead of using keyboard shortcuts, developers are writing the functions responsible for searching and replacing the strings directly in the code. Regex is supported in many programming languages, including Python.

Project background

International corporations hire dozens, if not hundreds, of data analysts, engineers and developers from different countries and teams who generate thousands of SQL queries for data gathering or reporting. These queries are saved in repositories or shared in emails with others. A poor code review process might cause queries to be a little rubbished, which needs to be fixed at some time to improve code reading by new employees without experience or knowledge of the project. To remove all necessary comments and keep the worthy ones, I wrote a simple and easy-to-implement Regex definition that will clean up old and every new SQL query uploaded onto the server.

Removing comments with regular expressions

I cannot share the real queries that have been used for the project (as they would add unnecessary complexity to this blog post, and include some sensitive data) but below I attached a very good representation of the queries and comments that I was struggling with.

First, I need to highlight the comments that the company did not want to remove:

--~This is the comment that we do not want to remove~

--¬Same for this one¬

They are important for data loading processes in different tools, so regular expressions must keep them but remove all non-special comments that match to below examples:

--If you see this, you need to remove it

/*This is also something what we want to remove*/

/*
Same for this block of text
*/

-- This is normal comment that must be removed --

--Same for this one –

I hope that these comment examples help you understand the company’s goal to remove not all the strings in the scripts but only those that are not mandatory for other data engineering processes. For an explanation of the work that has been done, I prepared the query that includes all already mentioned comments:

--~This query collects information about european countries~
SELECT * FROM table
/*
SELECT * FROM table2
where country = 'Spain'
*/
-- where country = 'Italy'
where country = /* data might be missing for this country */ 'Germany'
--AND city = 'Rome'
AND city = 'Berlin' -- monkey is an animal
limit 30
--¬End of query, the report will start loading from here¬
--Please do not share with others --

If we have a query prepared, we can start working on a simple regex definition that will be used for comments cleaning. Basic knowledge of regular expressions is necessary to understand how to search, find and replace strings but I highly recommend using https://regex101.com/ for quick testing. It includes some kind of explanation and live preview of written expressions that help to understand how the script is working.

import re

def sql_cleaner(query):
    query_cleaned = (re.sub(r'--[^~¬].*\n',"",query))
    query_cleaned = (re.sub(r'(?s)/\*(.*?)\*/', "", query_cleaned, re.DOTALL,  re.MULTILINE))
    return query_cleaned

x = r"""
--~This query collects information about european countries~
SELECT * FROM table
/*
SELECT * FROM table2
where country = 'Spain'
*/
-- where country = 'Italy'
where country = /* data might be missing for this country */ 'Germany'
--AND city = 'Rome'
AND city = 'Berlin' -- monkey is an animal
limit 30
--¬End of query, the report will start loading from here¬
--Please do not share with others -- 
"""

output = sql_cleaner(x)
print(output)

The cleaned output, that you can see below, is much more readable and improves the speed of the reviewing process for other team members. Since I had to use multiple conditions for the comments removal process, I decided to use two separate regular expressions. The first one removes the single-line comments that start with “--”. Then, the latter cleans the remaining multi-line comments using the special flags re.DOTALL and re.MULTILINE. The result looks as follows:

--~This query collects information about european countries~
SELECT * FROM table


where country = 'Germany'

AND city = 'Berlin'
limit 30
--¬End of query, the report will start loading from here¬

Summary

The most difficult part of this process might not be the actual preparation of the cleaning definition. Based on this example, just two lines of code will do the whole work for you and for an unlimited number of queries. However, understanding the regex syntax might be quite challenging if you see it for the first time. It works similarly to CASE WHEN or IF formulas but it is not so easy to write in a programming language, so using https://regex101.com/ will help you a lot in making the first steps in that powerful tool. I hope you can simply implement the regex presented in this article. If your case is much more complicated, the good news is that regular expressions are very popular among different programming languages, so you should easily find more useful examples and tutorials on the web. Good luck with the implementation!