Introduction
Hello, amazing DEV people! 😉
Today I will show you a wonderful query optimization technique for Postgres that I often use myself. This approach to optimization can save you from a long and tedious transfer of your project to another technology stack, such as GraphQL.
Intrigued? Here we go! 👇
📝 Table of contents
Problem statement
We'll take query optimization as an example of a simple task for any developer. Let's imagine that we have the task of creating a new endpoint for the REST API of our project, which should return:
- Data on the requested project by its alias;
- Array of all tasks that relate to the requested project in descending order by creation date;
- Number of tasks (as a separate response attribute);
Here you can immediately see one quick solution — make several queries for each of the models in the database (for the project and for related tasks for that project).
Well, let's look at it in more detail.
☝️ Note: I will give all the code samples in Go with Fiber web framework, since this is my main language & framework for backend development at the moment.
A quick solution to the problem
Okay, here is our controller for the endpoint:
// ./app/controllers/project_controller.go
// ...
// GetProjectByAlias func for getting one project by given alias.
func GetProjectByAlias(c *fiber.Ctx) error {
// Catch project alias from URL.
alias := c.Params("alias")
// Create database connection.
db, err := database.OpenDBConnection()
if err != nil {
return err
}
// Get project by ID.
project, err := db.GetProjectByAlias(alias)
if err != nil {
return err
}
// Get all tasks by project ID.
tasks, err := db.GetTasksByProjectID(project.ID)
if err != nil {
return err
}
// Return status 200 OK.
return c.JSON(fiber.Map{
"status": fiber.StatusOK,
"project": project, // <-- 1
"tasks_count": len(tasks), // <-- 2
"tasks": tasks, // <-- 3
})
}
As you can see, this controller fully meets the conditions of our task (all three points of the original problem).
— It will work?
— Yes, of course!
— Would such code be optimal?
— Probably not… 🤷
We call alternately the functions GetProjectByAlias
and GetTasksByProjectID
which creates additional latency and wastes additional resources of both the server API and the PostgreSQL database itself.
It's all because queries in DB most likely look like this:
-- For Project model:
SELECT *
FROM
projects
WHERE
alias = $1::varchar
LIMIT 1
-- For Task model:
SELECT *
FROM
tasks
WHERE
project_id = $1::uuid
ORDER BY
created_at DESC
Since the Go language created for speed and efficient use of server resources, such a waste of resources is simply unacceptable for any self-respecting Go developer.
Let's fix that in the next section.
Optimize this
So, how do we optimize this? Of course, by reducing the number of queries to the database. But then how do we get all the necessary tasks for the project and their number?
This is helped by the wonderful built-in aggregate function jsonb_agg that have appeared in PostgreSQL v9.6
and are constantly being improved from version to version.
Furthermore, we will be using COALESCE
function with FILTER
condition to correctly handle an empty value when the project may have no tasks. And immediately count the number of tasks through the COUNT
function.
☝️ Note: See more info about
COALESCE
here.
SELECT
p.*,
COALESCE(jsonb_agg(t.*) FILTER (WHERE t.project_id IS NOT NULL), '[]') AS tasks,
COUNT(t.id) AS tasks_count
FROM
projects AS p
LEFT JOIN tasks AS t ON t.project_id = p.id
WHERE
p.alias = $1::varchar
GROUP BY
p.id
LIMIT 1
It's a little difficult to understand the first time, isn't it? Don't worry, you'll figure it out! Here's an explanation of what's going on here:
- Output all the data about the found project;
- We got only one project, which has a unique alias we are looking for;
- Using the
LEFT JOIN
function, we only joined the sample of tasks that have a connection to the project by ID; - We grouped all the data by project ID;
- We did an aggregation of all obtained tasks using the aggregation function
jsonb_agg
, filtering it all by project ID; - For projects that have no tasks, we provided a display in the form of an empty list;
- We used the
COUNT
function to calculate the number of tasks in the project;
Next, we just need to prepare the output of all the data obtained from the database. Let's add the appropriate structures to the Project
and Task
models.
A simplified structure with a description of each project task:
// ./app/models/task_model.go
// ...
// GetProjectTasks struct to describe getting tasks list for given project.
type GetProjectTasks struct {
ID uuid.UUID `db:"id" json:"id"`
Alias string `db:"alias" json:"alias"`
Description string `db:"description" json:"description"`
}
And additional structures for the Project
model:
// ./app/models/project_model.go
// ...
// ProjectTasks struct to describe getting list of tasks for a project.
type ProjectTasks []*GetProjectTasks // struct from Task model
// GetProject struct to describe getting one project.
type GetProject struct {
ID uuid.UUID `db:"id" json:"id"`
CreatedAt time.Time `db:"created_at" json:"created_at"`
UpdatedAt time.Time `db:"updated_at" json:"updated_at"`
UserID uuid.UUID `db:"user_id" json:"user_id"`
Alias string `db:"alias" json:"alias"`
ProjectStatus int `db:"project_status" json:"project_status"`
ProjectAttrs ProjectAttrs `db:"project_attrs" json:"project_attrs"`
// Fields for JOIN tables:
TasksCount int `db:"tasks_count" json:"tasks_count"`
Tasks ProjectTasks `db:"tasks" json:"tasks"`
}
☝️ Note: The
ProjectTasks
type needed to correctly output a list of all the tasks in the project.
Let's fix controller:
// ./app/controllers/project_controller.go
// ...
// GetProjectByAlias func for getting one project by given alias.
func GetProjectByAlias(c *fiber.Ctx) error {
// Catch project alias from URL.
alias := c.Params("alias")
// Create database connection.
db, err := database.OpenDBConnection()
if err != nil {
return err
}
// Get project by ID with tasks.
project, err := db.GetProjectByAlias(alias)
if err != nil {
return err
}
// Return status 200 OK.
return c.JSON(fiber.Map{
"status": fiber.StatusOK,
"project": project, // <-- 1, 2, 3
})
}
The final optimized query result for our new endpoint should look like this:
{
"status": 200,
"project": {
"id": "a5326b7d-eb6c-4d5e-b264-44ee15fb4375",
"created_at": "2021-09-21T19:58:30.939495Z",
"updated_at": "0001-01-01T00:00:00Z",
"user_id": "9b8734f9-05c8-43ac-9cd8-d8bd15230624",
"alias": "dvc08xyufws3uwmn",
"project_status": 1,
"project_attrs": {
"title": "Test title",
"description": "Test description",
"category": "test"
},
"tasks_count": 5,
"tasks": [
{
"id": "26035934-1ea4-42e7-9364-ef47a5b57126",
"alias": "dc3b9d2b6296",
"description": "Task one"
},
// ...
]
}
}
That's how gracefully and easily we used all the power of built-in Postgres function and pure SQL to solve a database query optimization problem.
Wow, how great is that? 🤗
Postgres query analyzing
As rightly noted in the comments, this article lacks some kind of analytics on query execution time. Well, I'll fix that by demonstrating a synthetic result PostgreSQL EXPLAIN function with ANALYSE
method.
The test will involve three queries:
- Two simple SELECT of the project and all the tasks of the project (by ID, which I put in the
INDEX
); - A complex query with two
LEFT JOIN
tables and create an aggregateJSONB
object (for ease of output, without resorting to conversions within Golang, only built-in means Postgres 13).
☝️ Note: I specifically took a more complex query (rather than the one in the article above) to demonstrate how well and efficiently the Postgres database is able to perform queries.
There are 3 projects in my test table, each with 2 tasks. The database itself runs on an Apple MacBook Pro early 2015 (intel i5, 8 GB RAM) in a Docker container with the latest stable version of Postgres 13.x (13.4-1.pgdg100+1
).
So, simple queries will look like this:
-- First simple query:
SELECT *
FROM
projects
WHERE
id = '6e609cb8-d62d-478b-8691-151d355af59d'
LIMIT 1
-- Second simple query:
SELECT *
FROM
tasks
WHERE
project_id = '6e609cb8-d62d-478b-8691-151d355af59d'
And here's a complex query:
-- Third complex query:
SELECT
p.id,
p.created_at,
p.updated_at,
p.project_status,
p.project_attrs,
jsonb_build_object(
'user_id', u.id,
'first_name', u.user_attrs->'first_name',
'last_name', u.user_attrs->'last_name',
'picture', u.user_attrs->'picture'
) AS author,
COUNT(t.id) AS tasks_count,
COALESCE(
jsonb_agg(
jsonb_build_object(
'id', t.id,
'status', t.task_status,
'name', t.task_attrs->'name',
'description', t.task_attrs->'description',
'steps_count', jsonb_array_length(t.task_attrs->'steps')
)
)
FILTER (WHERE t.project_id IS NOT NULL), '[]'
) AS tasks
FROM
projects AS p
LEFT JOIN users AS u ON u.id = p.user_id
LEFT JOIN tasks AS t ON t.project_id = p.id
WHERE
p.id = '6e609cb8-d62d-478b-8691-151d355af59d'
GROUP BY
p.id,
u.id
LIMIT 1
If you run these queries one by one with the EXPLAIN ANALYSE SELECT ...
function, you can get the following results:
# For first simple query:
Limit (cost=0.15..8.17 rows=1 width=84) (actual time=0.263..0.302 rows=1 loops=1)
-> Index Scan using projects_pkey on projects (cost=0.15..8.17 rows=1 width=84) (actual time=0.238..0.248 rows=1 loops=1)
Index Cond: (id = '6e609cb8-d62d-478b-8691-151d355af59d'::uuid)
Planning Time: 0.177 ms
Execution Time: 0.376 ms
# For second simple query:
Seq Scan on tasks (cost=0.00..17.88 rows=3 width=100) (actual time=0.026..0.056 rows=2 loops=1)
Filter: (project_id = '6e609cb8-d62d-478b-8691-151d355af59d'::uuid)
Planning Time: 0.180 ms
Execution Time: 0.139 ms
# For third complex query:
Limit (cost=34.37..34.42 rows=1 width=156) (actual time=0.351..0.479 rows=1 loops=1)
-> GroupAggregate (cost=34.37..34.42 rows=1 width=156) (actual time=0.333..0.437 rows=1 loops=1)
Group Key: p.id, u.id
-> Sort (cost=34.37..34.37 rows=1 width=184) (actual time=0.230..0.335 rows=2 loops=1)
Sort Key: u.id
Sort Method: quicksort Memory: 27kB
-> Nested Loop Left Join (cost=0.29..34.36 rows=1 width=184) (actual time=0.106..0.260 rows=2 loops=1)
Join Filter: (t.project_id = p.id)
-> Nested Loop Left Join (cost=0.29..16.44 rows=1 width=116) (actual time=0.063..0.128 rows=1 loops=1)
-> Index Scan using projects_pkey on projects p (cost=0.15..8.17 rows=1 width=84) (actual time=0.021..0.038 rows=1 loops=1)
Index Cond: (id = '6e609cb8-d62d-478b-8691-151d355af59d'::uuid)
-> Index Scan using users_pkey on users u (cost=0.14..8.16 rows=1 width=48) (actual time=0.014..0.022 rows=1 loops=1)
Index Cond: (id = p.user_id)
-> Seq Scan on tasks t (cost=0.00..17.88 rows=3 width=68) (actual time=0.018..0.043 rows=2 loops=1)
Filter: (project_id = '6e609cb8-d62d-478b-8691-151d355af59d'::uuid)
Planning Time: 0.226 ms
Execution Time: 0.585 ms
At first glance, it may seem that two simple queries are much more effective than this complicated one. But do not be fooled by the low values of Planning
and Execution
because this test does not consider the network lag of the request and the subsequent processing in your Go program!
Regarding a complex query, we have already generated an object “under the hood” through the built-in PostgreSQL functions, which we simply pass in a JSON response to the consumer.
Therefore, on a really high load will always win one complex request (which can also be cached, which I do in my projects) than many simple ones.
👌 Note: It's fine if there are only two queries, but I've refactored many projects where one endpoint went to the database more than 10 times with such “simple queries”. This caused the database to be constantly under a much greater load than one prepared complex query.
Photos and videos by
- Hannah Busing https://unsplash.com/photos/Zyx1bK9mqmA
P.S.
If you want more articles (like this) on this blog, then post a comment below and subscribe to me. Thanks! 😻
❗️ You can support me on Boosty, both on a permanent and on a one-time basis. All proceeds from this way will go to support my OSS projects and will energize me to create new products and articles for the community.
And of course, you can help me make developers' lives even better! Just connect to one of my projects as a contributor. It's easy!
My main projects that need your help (and stars) 👇
- 🔥 gowebly: A next-generation CLI tool that makes it easy to create amazing web applications with Go on the backend, using htmx, hyperscript or Alpine.js and the most popular CSS frameworks on the frontend.
- ✨ create-go-app: Create a new production-ready project with Go backend, frontend and deploy automation by running one CLI command.
Top comments (5)
why open connection database in every router instead of opening 1 connect in main.go
Thanks for this comment! 😊 This article only serves as an example to understand the topic. All Go-code examples have been prepared in order to introduce any reader more quickly. Don't judge strictly, please!
Yes, you are right: it is better to put the database connection in the start area of the application. Better yet, do it via DI (I use the
google/wire
package, for example).Hi Vik. That optimisation may be have a metrics? may be coalesce operator gave more time above previous two linear queries...may be - who knows
Hi,
Thanks for this comment! I added a new chapter called Postgres query analyzing to article for more explain reasons of this article.
Thanks for your tips. As for PostgreSQL query optimization, I guess this guide also can be useful devart.com/dbforge/postgresql/stud...