Table of Contents
- Beyond Simple SQL: Capturing Complex Business Logic
- Our Solution: A Better Way to Combine Different Methods
- Testing Our Ideas With Synthetic Data
- Key Findings and Analysis
- Improving Recall With Better Prompts
- How Well Each Prompt Worked
- What We Learned
- Saving Time and Money
- What This Means for Other Tasks
Today, text-to-SQL aims to make it easier for people to work with databases by letting them use everyday language to ask questions about complex data. However, many current methods don't work well because they focus too much on just creating SQL code.
In this post, I won't dive into optimizing SQL generation itself. Instead, I want to share a technique I've found useful for improving how we approach many prompting issues using text-to-SQL as the test bed: using synthetic summaries to enhance few-shot learning.
Few-shot learning is a method where we provide a system with a small number of examples to help it understand and perform a task. In our case, we're using it to help match user questions to relevant SQL queries. The key idea we’ll explore is the use of synthetic summaries of SQL queries to improve this matching process.
Here's the basic idea:
- We take existing SQL queries and create detailed summaries of what they do.
- We use these summaries, rather than the raw SQL, when trying to match user questions to relevant queries.
- We can generate more of these summaries as needed, allowing us to improve our system continually.
From my tests, this approach has shown some promising results:
- It improved accuracy in finding the right information from 81 % to 90 %.
- It maintained good performance even when we added irrelevant data.
- It seemed better at capturing business-specific logic than traditional methods.
One of the big advantages I've found with this method is its simplicity. When we find situations where the system struggles, we can often improve performance just by adding more example summaries. This is usually easier and more effective than trying to fine-tune complex AI models or write intricate rules.
In the rest of this post, I'll walk through how I developed this approach, the challenges I faced, and the solutions I came up with. My hope is that by sharing this experience, I can provide useful insights for others working on similar problems, even if they're not specifically focused on SQL generation.
Remember, the key takeaway here isn't about SQL itself but how we can use synthetic data and summaries to develop LLM-based AI systems. This approach could potentially be applied to many other areas where we're trying to match natural language queries to specific technical information.
Beyond Simple SQL: Capturing Complex Business Logic
In my years as a data scientist, I've encountered a recurring theme: what seems like a straightforward database query often isn't in the real business world. Let's explore this idea through a common yet deceptively complex request: "Show me the month-over-month growth."
Example: Month-over-month growth
A junior engineer might approach this task with a seemingly reasonable query:
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(total_amount) AS monthly_revenue,
LAG(SUM(total_amount)) OVER (ORDER BY DATE_TRUNC('month', order_date)) AS previous_month_revenue,
(SUM(total_amount) - LAG(SUM(total_amount)) OVER (ORDER BY DATE_TRUNC('month', order_date))) /
LAG(SUM(total_amount)) OVER (ORDER BY DATE_TRUNC('month', order_date)) * 100 AS month_over_month_growth
FROM
orders
GROUP BY
DATE_TRUNC('month', order_date)
ORDER BY
month;
This query compares each calendar month's total revenue to the previous month. However, in many businesses, especially retail, this approach falls short of capturing the nuances required for accurate financial reporting.
Let's look at a more sophisticated version that aligns with how many finance teams actually calculate this metric:
WITH daily_revenue AS (
SELECT
DATE(order_date) AS date,
SUM(CASE
WHEN order_status IN ('completed', 'shipped')
AND payment_status = 'paid'
AND return_date IS NULL
THEN total_amount
ELSE 0
END) AS revenue
FROM
orders
LEFT JOIN returns ON orders.order_id = returns.order_id
GROUP BY
DATE(order_date)
),
rolling_28_day AS (
SELECT
date,
SUM(revenue) OVER (
ORDER BY date
ROWS BETWEEN 27 PRECEDING AND CURRENT ROW
) AS revenue_28_day,
LAG(SUM(revenue) OVER (
ORDER BY date
ROWS BETWEEN 27 PRECEDING AND CURRENT ROW
), 28) OVER (ORDER BY date) AS prev_revenue_28_day
FROM
daily_revenue
),
fiscal_periods AS (
SELECT
date,
revenue_28_day,
prev_revenue_28_day,
FLOOR((EXTRACT(DOY FROM date) - 1) / 28) + 1 AS fiscal_period
FROM
rolling_28_day
)
SELECT
fiscal_period,
MAX(date) AS period_end_date,
SUM(revenue_28_day) AS period_revenue,
SUM(prev_revenue_28_day) AS prev_period_revenue,
CASE
WHEN SUM(prev_revenue_28_day) > 0 THEN
(SUM(revenue_28_day) - SUM(prev_revenue_28_day)) / SUM(prev_revenue_28_day) * 100
ELSE
NULL
END AS growth_rate
FROM
fiscal_periods
GROUP BY
fiscal_period
ORDER BY
fiscal_period DESC;
This version incorporates several considerations:
- Revenue recognition : Only completed or shipped orders that have been paid and not returned are counted as revenue.
- Fiscal periods : Instead of calendar months, it uses 28-day fiscal periods, ensuring each period has the same number of each day of the week.
- Rolling window : A 28-day rolling window is applied within these fiscal periods.
- Period aggregation : It aggregates the entire period's revenue and compares it to the previous period.
- Flexible period end : The end of each fiscal period is dynamically determined.
- Growth rate calculation : The growth rate is calculated using the total revenue for each fiscal period.
Why business logic matters
This "month-over-month" example is just the tip of the iceberg. Nearly every seemingly simple business metric can have hidden complexities:
- Fiscal years : A company's fiscal year might not align with the calendar year, affecting all year-to-date calculations.
- Customer segmentation : The definition of a "high-value customer" can vary wildly between businesses and even between departments within the same company.
- Inventory turnover : This calculation might differ between industries or even product lines within the same company.
- Sales attribution : In businesses with complex sales cycles, attributing a sale to a particular marketing channel or salesperson often involves intricate logic.
The power of a SQL repository
Given these complexities, maintaining a repository of well-established SQL queries becomes crucial. This repository serves several key functions:
- Consistency : It ensures that everyone in the organization is using the same definitions and calculations for key metrics.
- Knowledge transfer : New team members can quickly understand how the company calculates various metrics.
- Automation foundation : These queries can serve as the basis for automated reporting systems.
- Compliance : In regulated industries, having a centralized repository of approved calculations can aid in audit processes.
When applied to a text-to-SQL system, this repository becomes even more powerful. Instead of generating generic SQL, the system can pull from a library of queries that already embed the company's specific business logic and preferred calculation methods.
Having this library of queries to pull from in this repository of examples goes beyond just typical SQL. You can imagine using examples like these to draft emails, write code, extract data, and more.
The key in this application is that while having the repository of SQL code is powerful, using synthetic summaries can improve the recall of that retrieval system.
Our Solution: A Better Way to Combine Different Methods
To make our system work better, we use a mix of different tools. Here's how it works:
- Synthetic summaries : We use AI to create detailed summaries of tables and SQL code. These summaries help us understand the information better than just looking at keywords.
-
Retrieval augmented in-context learning : Our system employs a two-step process to enhance SQL generation:
- We first identify relevant tables and snippets.
- The retrieved tables and code examples are used to provide context-rich instructions to our large language model (LLM), enabling it to better understand and apply specific business rules when generating SQL queries.
def create_sql_answer(question):
tables = BASIC_TABLES + find_related_tables(question)
code_examples = BASIC_EXAMPLES + find_related_examples(question)
final_answer = ai.generate_answer(tables, code_examples, question)
return final_answer
Here's a simplified example of how we prompt our LLM for SQL generation:
You are an SQL analyst tasked with writing SQL queries to answer specific questions about data stored in a database. Your goal is to create an accurate and efficient SQL query that answers the given question.
Here is the question you need to answer:
<question>
{{QUESTION}}
</question>
To help you write the SQL query, here are some relevant tables you can use:
<tables>
{{TABLES}}
</tables>
Additionally, here are some relevant example snippets that may be helpful:
<examples>
{{EXAMPLES}}
</examples>
Follow these steps to create your SQL query:
1. Carefully analyze the question to understand what data is being requested.
2. Identify the relevant tables and columns needed to answer the question.
3. Determine any necessary joins, filters, aggregations, or other SQL operations required to produce the desired result.
4. Write a clear and efficient SQL query that answers the question.
5. After writing the query, briefly explain your approach and any key decisions you made in constructing the query.
Please provide your response in the following format:
<sql_query>
[Your SQL query here]
</sql_query>
<explanation>
[Your explanation of the query and approach here]
</explanation>
Remember to use proper SQL syntax and best practices when writing your query. If you need any clarification or additional information to answer the question, please state so in your explanation.
Testing Our Ideas With Synthetic Data
Now that we had our idea, we needed to test it out. We used language models to generate synthetic data in an attempt to build a retrieval test set. Our goal was to measure how well our prompts performed compared to simply retrieving snippets. However, we discovered that without augmenting a real-world dataset, our tests were too easy. We needed additional data augmentation and labeling to produce a test suite that was challenging enough for us to iterate on our summarization prompts effectively.
To evaluate our approach of using synthetic summaries for few-shot learning in text-to-SQL tasks, we needed a testing framework. Here's how we developed and refined our testing process:
Initial attempts with simple synthetic data
We started by creating our own simple queries and corresponding questions.
SELECT product_id, COUNT(impression_id) as impression_count
FROM impressions
WHERE product_id = $1 AND deleted_at IS NULL
GROUP BY product_id;
Question : "How can I find out how many impressions a particular product has received, excluding any that have been deleted?"
However, this approach proved too simplistic. Our system achieved suspiciously high scores (recall@5 of 0.95), indicating our test set wasn't challenging enough.
Leveraging complex open-source datasets
To create a more realistic test environment, we turned to the BirdSQL dataset, which contains over 1,500 queries working with 95 separate tables. This provided the complexity we needed to mirror real-world scenarios.
Example query from BirdSQL:
SELECT
(CAST(SUM(T1.id) AS REAL) / COUNT(T1.id)) / 4,
T2.language
FROM
sets AS T1
INNER JOIN
set_translations AS T2 ON T1.id = T2.id
WHERE
T1.releaseDate BETWEEN '2012-01-01' AND '2015-12-31'
GROUP BY
T1.releaseDate
ORDER BY
COUNT(T2.language) DESC
LIMIT 1
Original question: "What is the annual average number of sets that were released between 1/1/2012 to 12/31/2015? Indicate the common language of the card."
Addressing overly specific questions
We found that many questions in the dataset were too specific, often including details explicitly stated in the query, leading to inflated performance metrics.
Using LLMs to generalize questions
To address this, we used large language models to rewrite questions, creating more generalized versions that captured the intent without leaking specific details.
Rewritten question: "What is the average popularity of sets released from 2012-15 by lang?"
This process significantly changed our performance metrics (which was desired, we didn’t want a test too easy):
- Recall@5 for snippets: 0.84 → 0.68 (20 % decrease)
- Recall@5 for summaries: 0.9 → 0.81 (10 % decrease)
- Recall@10 for snippets: 0.92 → 0.82 (11 % decrease)
- Recall@10 for summaries: 0.95 → 0.92 (4 % decrease)
Generating table labels with LLMs
Since BirdSQL didn't include relevancy labels for question and SQL pairs, we used GPT to automatically generate these labels. We provided the model with available tables and the SQL snippet, then asked it to identify which were relevant.
Prompt example:
Given a set of tables and a SQL snippet, describe if the tables or snippets are relevant to the question.
Here are the relevant tables:
<tables>
[insert tables in database here]
</tables>
Here is the snippet:
<snippet>
[insert snippet here]
</snippet>
Here is the question:
<query>
[insert question]
</query>
Determine if the snippet is relevant
This allowed us to produce a dataset against which to test information retrieval metrics. Now, our only job is to iterate on a summary prompt that will help us improve our recall metrics.
Iterating on summary generation
We continuously refined our prompts for generating synthetic summaries. Our final prompt instructed the model to do the following:
- Highlight the main objective of the query.
- Mention important data sources (tables) used.
- Describe key SQL operations employed.
- Explain primary metrics or calculations performed.
- Detail significant filters or conditions applied.
- Describe the expected output and its business relevance.
- Suggest 2-3 specific ways the query could be modified or extended.
- Propose potential questions that this query might help answer.
This iterative process led to significant improvements. Our final prompt achieved a recall@5 of 0.91, a 54 % improvement over embedding the code example itself.
Evaluating robustness to noise
To test the robustness of our approach, we intentionally introduced irrelevant data into our test set. We found that our summary-based method maintained a recall@5 of over 90 %, even as we increased the amount of irrelevant data. In contrast, the snippet-based approach saw recall@5 drop below 70 % under similar conditions.
We wanted to see how well our system could handle extra, unrelated information—like finding a needle in a haystack, but with more hay added. To test this, we deliberately mixed irrelevant data into our test set.
Method | With 10% irrelevant data | With 50% irrelevant data |
---|---|---|
Summaries | 92% accuracy | 90% accuracy |
Snippets | 85% accuracy | 68% accuracy |
Our summary-based method kept working well, maintaining about 90 % accuracy in finding the right information (recall@5), even when half of the data was irrelevant. It's like our system could still find the needle, even after we added a lot more hay to the stack.
On the other hand, the snippet-based approach struggled more as we added irrelevant data. Its accuracy dropped from 85 % to 68 % when we increased irrelevant data from 10 % to 50 %.
This test showed that our summary method is better at ignoring extra, unrelated information and focusing on what's important. This is crucial for real-world applications, where databases often contain lots of tables and information that aren't always relevant to every query.
Key takeaways from our testing process
- Start simple : Beginning with simple queries helped set up our initial framework, but complex queries were needed for proper testing.
- Use realistic data : Leveraging existing datasets provided a level of complexity and realism we couldn't achieve manually. If your baselines are high, chances are your task is too easy. It's crazy.
- Beware of data leakage : Questions too specific to queries led to artificially high-performance metrics.
- Leverage LLMs creatively : We used language models for generating summaries, rewriting questions, and generating labels.
- Iterate on prompts : Continuously refining our prompts was key to improving system performance.
- Measure objectively : Using metrics like recall@5 and recall@10 gave us concrete ways to measure improvement.
- Test for robustness : Introducing noise helped us evaluate how well our approach would perform in less-than-ideal conditions.
This testing process not only helped us refine our approach but also demonstrated the potential of using synthetic summaries for few-shot learning in text-to-SQL tasks and potentially other natural language processing challenges.
Key Findings and Analysis
Our testing revealed several important insights:
1. Improved retrieval accuracy : Our summary index approach improved recall@5 from 81 % to 90 %, a significant boost in information retrieval performance.
2. Robustness to noise : Summaries maintained a recall@5 of over 90 % even when irrelevant data was introduced. This was in stark contrast to snippets that saw recall@5 drop to below 70 % as we introduced more irrelevant snippets.
3. Critical role of prompt engineering : The quality of our summaries, and consequently the performance of our entire system, was highly dependent on the prompts used to generate these summaries.
Improving Recall With Better Prompts
In our experiments, we found that the way we ask our AI model to summarize SQL queries makes a big difference. We tried three different ways of asking, or "prompts," and each one gave us better results. Let's look at what we did and what we learned:
- Simple prompt : We just asked for a quick summary.
Generate a concise, keyword-rich summary of the given SQL query without starting with phrases like "This SQL" or "The query". Avoid adjectives and focus on technical details.
<snippet>
{{ query }}
</snippet>
- Role-based prompt : We told the AI to act like an experienced data analyst and gave it some rules.
<role>
You are an experienced Data Analyst with extensive SQL expertise that's cataloguing a SQL query for future use.
</role>
Generate a concise, keyword-rich summary of the given SQL query without starting with phrases like "This SQL" or "The query". Avoid adjectives and focus on technical details.
Include:
1. Main SQL operations and techniques used
2. Data types and business context analyzed
3. Key calculations and metrics
4. Table names and important columns (be 5. specific and comprehensive)
6. Filtering criteria and time frames
7. Specific SQL functions and clauses
8. Query structure elements
9. Output description and its business value
<snippet>
{{ query }}
</snippet>
- Detailed prompt with examples : We gave the AI a specific role, clear instructions, and examples of good summaries.
<role>
You are an experienced Data Analyst with extensive SQL expertise that's cataloguing a SQL query for future use.
</role>
<task>
Your task is to review the given SQL query and produce a concise, informative summary that covers the query's key aspects. Your summary should be structured in a single paragraph and provide enough detail for other analysts to quickly understand the query's purpose and potential modifications for their use cases.
Focus on highlighting:
1. The main objective of the query
2. Important data sources (tables) used
3. Key SQL operations employed
4. Primary metrics or calculations performed
5. Significant filters or conditions applied
6. The expected output and its business relevance
7. 2-3 specific ways the query could be modified or extended for related analytical needs.
8. Potential questions that other analysts might have which this query might be useful for answering.
Make sure to output valid JSON in your response without any potential trailing characters at the end.
</task>
<rules>
1. Keep your summary focused and concise, avoiding unnecessary elaboration or general statements about SQL techniques.
2. Do not explain basic SQL concepts or the value of the query structure unless directly relevant to its unique features.
3. Avoid listing out all possible values for fields or explaining every single value unless specifically relevant to the query's purpose)
4. When suggesting modifications, be specific and relate them directly to potential business questions the query could answer with small changes.
5. Use technical SQL terminology appropriately, but don't overexplain common operations.
6. Highlight any unusual or particularly clever aspects of the query that might not be immediately obvious.
</rules>
<query>
{{SNIPPET}}
</query>
<examples>
[Examples omitted for brevity]
</examples>
What we got
1. Simple prompt summary:
Calculate Elijah Allen's total average event cost for September and October.
2. Role-based prompt summary:
Aggregate average cost calculation for Elijah Allen's spending on events in September and October. Filters by user and two-month time period using SQL WHERE and BETWEEN clauses. Targets events data table and cost column to compute key metric, average cost, highlighting financial insights into individual's event-related expenditures.
3. Detailed prompt summary:
This SQL query calculates the average expense for member 'Elijah Allen' during September and October. It joins the 'member' and 'expense' tables using 'member_id', filters for 'Elijah Allen' by first and last name, and isolates expenses from September and October using a substring comparison on the 'expense_date' field. The result is the average expense incurred by Elijah Allen during those months. This query can be easily modified to analyze expenses for different periods, members, or member groups by adjusting the WHERE clause conditions. Further extensions could include aggregating expenses by category or comparing individual spending to average departmental expenses to identify outliers.
How Well Each Prompt Worked
We measured how well each prompt worked using recall@k, basically, how often the AI found the right information in the first k results. We also looked at how similar the summaries were to what we wanted. Here's what we found:
Approach | Recall@5 | Recall@10 | Similarity Score |
---|---|---|---|
Just SQL | 0.69 | 0.82 | N/A |
Simple Prompt | 0.59 | 0.74 | 0.45 |
Role-Based Prompt | 0.70 | 0.76 | 0.57 |
Detailed Prompt | 0.91 | 0.94 | 0.67 |
As you can see, our detailed prompt worked the best, finding the right information 91 % of the time in the top five results.
What We Learned
- Start simple, then add details : Beginning with a basic prompt provided a foundation for improvement.
- Assign the AI a role : Instructing the AI to assume an expert persona enhanced the quality of summaries.
- Provide clear instructions : Clearly outlining our requirements for the summary led to better results.
- Use examples : Demonstrating what constitutes a good summary significantly improved the AI's output.
- Implement evaluations : Surprisingly, longer summaries yielded better recall. Evaluations allowed us to shift from intuition-based to data-driven decision-making.
- Leverage evaluations for optimization : With evaluation metrics in place, we could make informed trade-offs between prompt complexity and performance.
Saving Time and Money
When using these prompts, there are two tricks we can use to save time and money:
- Prompt caching : This means we save parts of the prompt that don't change, like the instructions and examples, which are now available inOpenAI and Anthropic APIS. We only send the new parts (like the SQL query) each time. This can make things faster and cheaper.
- Batching : Instead of sending one query at a time, we can send a batch together. This is especially good when we don't need answers right away. It can save money because sending one big batch often costs 50 % less.
What This Means for Other Tasks
The way we improved our SQL summaries can help with other retrieval tasks, too:
- Summarizing what's in database tables
- Describing images
- Summarizing long documents
- Writing explanations for computer code
- Augmenting text chunks
As long as we have a test set, we can augment it with synthetic data and experiment with prompting techniques to improve our recall.
Remember, the prompt is super important. It's not just something you set up once and forget. It's a key part of making AI work well for specific tasks. As AI keeps getting better, knowing how to write good prompts, find good few-shot examples, and decompose problems into straightforward metrics will become an even more valuable skill.
If you’re looking for a database to build AI products with, check out pgai and Timescale Cloud. With Timescale Cloud, developers can access pgvector and pgai—extensions that turn PostgreSQL into an easy-to-use and high-performance vector database, plus a fully managed cloud database experience. Build your AI application with Timescale Cloud today (it's free for 30 days, no credit card required).
Top comments (0)