Tips For Writing 500+ Line SQL Query

Aug 26, 2024

TL;DR

Don’t write them.

Nam June Paik, *Electronic Superhighway: Continental U.S., Alaska, Hawaii*, 1995.

An incident at work yesterday taught me a valuable lesson about dealing with long and complicated SQL queries. I had spent four days fine-tuning and evaluating a LLM model, only to discover that there was a problem with the training data. While we expected at least a few thousand positive samples, it turned out that the training dataset contained less than one-tenth of them. After reviewing all the pre-processing steps, I concluded that the issue lay in the final data pipeline I had written to create the training data.

In retrospect, the way the training data pipeline was created had set it up for failure. I wrote the SQL query on a Saturday afternoon, around 6 p.m., when I was in a rush to have the training table ready asap so I could start fine-tuning before heading out for dinner. My query was over 500 lines long and contained complicated logic for combining labels from three different sources. In fact, it was so complicated that we had a separate document to outline all the steps (I recorded 12 steps in total). Considering all this, it’s surprising how confident I was in my data pipeline at the time.

After an hour of debugging with a senior engineer, we finally found the issue. There was a part of the query where I needed to join two data sources and union the labels for their overlapping rows while keeping their respective individual rows. I used a FULL OUTER JOIN for this and had something like this:

SELECT 
    t1.id, 
    CASE 
        WHEN t1.data IS NULL THEN t2.data,
        WHEN t2.data IS NULL THEN t1.data,
        ELSE t1.data || t2.data
    END AS combined_data
FROM df

You might have already spotted the issue here — I took care of the fact that t1.data and t2.data could be NULL and handled them with a CASE WHEN clause. However, I completely ignored the id column and the fact that id would also be NULL if data is NULL. This results in a lot of ids in t2 becoming NULL, which explains the training data loss. While this may be an easy catch in this simplified skeleton code, it becomes less obvious when id is one of a dozen fields in the table and this is line 377 out of a 523-line query.

My takeaways:

Don't trade off quality for speed and don’t mistake speed for efficiency. Efficiency is a good goal to optimize for, but keep in mind that efficiency means quality AND speed. If the quality of work is sacrificed during the pursuit of speed, it will render the work useless and lead to greater inefficiency.
Avoid writing overly long queries because they are harder to understand and debug. Break down the long query into multiple steps, and create intermediary tables for each step. If it has to be done in one SQL query, use CTEs as much as possible.
Perform regular sanity checks. It may be tempting to blow through writing the entire query in one shot, but doing this risks sacrificing quality for speed. Instead, after each intermediate step, check that the total count, distinct count, and null count all look reasonable. Most importantly, run the simple SELECT * FROM df and look through 5-10 rows. In most cases, doing these simple checks would be enough to catch whatever issues the query has.

Fusion

Discussion about this post

Ready for more?