Vic Shóstak

Posted on Oct 4, 2021 • Edited on Aug 17, 2024

💆‍♀️ 💆‍♂️ PostgreSQL query optimization for Gophers: It's much easier than it sounds!

#postgres #tutorial #beginners #database

Introduction

Hello, amazing DEV people! 😉

Today I will show you a wonderful query optimization technique for Postgres that I often use myself. This approach to optimization can save you from a long and tedious transfer of your project to another technology stack, such as GraphQL.

Intrigued? Here we go! 👇

Problem statement

We'll take query optimization as an example of a simple task for any developer. Let's imagine that we have the task of creating a new endpoint for the REST API of our project, which should return:

Data on the requested project by its alias;
Array of all tasks that relate to the requested project in descending order by creation date;
Number of tasks (as a separate response attribute);

Here you can immediately see one quick solution — make several queries for each of the models in the database (for the project and for related tasks for that project).

Well, let's look at it in more detail.

☝️ Note: I will give all the code samples in Go with Fiber web framework, since this is my main language & framework for backend development at the moment.

↑ Table of contents

A quick solution to the problem

Okay, here is our controller for the endpoint:

// ./app/controllers/project_controller.go

// ...

// GetProjectByAlias func for getting one project by given alias.
func GetProjectByAlias(c *fiber.Ctx) error {
    // Catch project alias from URL.
    alias := c.Params("alias")

    // Create database connection.
    db, err := database.OpenDBConnection()
    if err != nil {
        return err
    }

    // Get project by ID.
    project, err := db.GetProjectByAlias(alias)
    if err != nil {
        return err
    }

    // Get all tasks by project ID.
    tasks, err := db.GetTasksByProjectID(project.ID)
    if err != nil {
        return err
    }

    // Return status 200 OK.
    return c.JSON(fiber.Map{
        "status":  fiber.StatusOK,
        "project": project,        // <-- 1
        "tasks_count": len(tasks), // <-- 2
        "tasks": tasks,            // <-- 3
    })
}

As you can see, this controller fully meets the conditions of our task (all three points of the original problem).

— It will work?
— Yes, of course!
— Would such code be optimal?
— Probably not… 🤷

We call alternately the functions GetProjectByAlias and GetTasksByProjectID which creates additional latency and wastes additional resources of both the server API and the PostgreSQL database itself.

It's all because queries in DB most likely look like this:

-- For Project model:

SELECT *
FROM
    projects
WHERE
    alias = $1::varchar
LIMIT 1

-- For Task model:

SELECT *
FROM
    tasks
WHERE
    project_id = $1::uuid
ORDER BY
    created_at DESC

Since the Go language created for speed and efficient use of server resources, such a waste of resources is simply unacceptable for any self-respecting Go developer.

Let's fix that in the next section.

↑ Table of contents

Optimize this

So, how do we optimize this? Of course, by reducing the number of queries to the database. But then how do we get all the necessary tasks for the project and their number?

This is helped by the wonderful built-in aggregate function jsonb_agg that have appeared in PostgreSQL v9.6 and are constantly being improved from version to version.

Furthermore, we will be using COALESCE function with FILTER condition to correctly handle an empty value when the project may have no tasks. And immediately count the number of tasks through the COUNT function.

☝️ Note: See more info about COALESCE here.

SELECT
    p.*,
    COALESCE(jsonb_agg(t.*) FILTER (WHERE t.project_id IS NOT NULL), '[]') AS tasks,
    COUNT(t.id) AS tasks_count
FROM
    projects AS p
    LEFT JOIN tasks AS t ON t.project_id = p.id
WHERE
    p.alias = $1::varchar
GROUP BY 
    p.id
LIMIT 1

It's a little difficult to understand the first time, isn't it? Don't worry, you'll figure it out! Here's an explanation of what's going on here:

Output all the data about the found project;
We got only one project, which has a unique alias we are looking for;
Using the LEFT JOIN function, we only joined the sample of tasks that have a connection to the project by ID;
We grouped all the data by project ID;
We did an aggregation of all obtained tasks using the aggregation function jsonb_agg, filtering it all by project ID;
For projects that have no tasks, we provided a display in the form of an empty list;
We used the COUNT function to calculate the number of tasks in the project;

Next, we just need to prepare the output of all the data obtained from the database. Let's add the appropriate structures to the Project and Task models.

A simplified structure with a description of each project task:

// ./app/models/task_model.go

// ...

// GetProjectTasks struct to describe getting tasks list for given project.
type GetProjectTasks struct {
    ID          uuid.UUID `db:"id" json:"id"`
    Alias       string    `db:"alias" json:"alias"`
    Description string    `db:"description" json:"description"`
}

And additional structures for the Project model:

// ./app/models/project_model.go

// ...

// ProjectTasks struct to describe getting list of tasks for a project.
type ProjectTasks []*GetProjectTasks // struct from Task model

// GetProject struct to describe getting one project.
type GetProject struct {
    ID            uuid.UUID    `db:"id" json:"id"`
    CreatedAt     time.Time    `db:"created_at" json:"created_at"`
    UpdatedAt     time.Time    `db:"updated_at" json:"updated_at"`
    UserID        uuid.UUID    `db:"user_id" json:"user_id"`
    Alias         string       `db:"alias" json:"alias"`
    ProjectStatus int          `db:"project_status" json:"project_status"`
    ProjectAttrs  ProjectAttrs `db:"project_attrs" json:"project_attrs"`

    // Fields for JOIN tables:
    TasksCount int          `db:"tasks_count" json:"tasks_count"`
    Tasks      ProjectTasks `db:"tasks" json:"tasks"`
}

☝️ Note: The ProjectTasks type needed to correctly output a list of all the tasks in the project.

Let's fix controller:

// ./app/controllers/project_controller.go

// ...

// GetProjectByAlias func for getting one project by given alias.
func GetProjectByAlias(c *fiber.Ctx) error {
    // Catch project alias from URL.
    alias := c.Params("alias")

    // Create database connection.
    db, err := database.OpenDBConnection()
    if err != nil {
        return err
    }

    // Get project by ID with tasks.
    project, err := db.GetProjectByAlias(alias)
    if err != nil {
        return err
    }

    // Return status 200 OK.
    return c.JSON(fiber.Map{
        "status":  fiber.StatusOK,
        "project": project,        // <-- 1, 2, 3
    })
}

The final optimized query result for our new endpoint should look like this:

{
 "status": 200,
 "project": {
  "id": "a5326b7d-eb6c-4d5e-b264-44ee15fb4375",
  "created_at": "2021-09-21T19:58:30.939495Z",
  "updated_at": "0001-01-01T00:00:00Z",
  "user_id": "9b8734f9-05c8-43ac-9cd8-d8bd15230624",
  "alias": "dvc08xyufws3uwmn",
  "project_status": 1,
  "project_attrs": {
   "title": "Test title",
   "description": "Test description",
   "category": "test"
  },
  "tasks_count": 5,
  "tasks": [
   {
    "id": "26035934-1ea4-42e7-9364-ef47a5b57126",
    "alias": "dc3b9d2b6296",
    "description": "Task one"
   },
   // ...
  ]
 }
}

That's how gracefully and easily we used all the power of built-in Postgres function and pure SQL to solve a database query optimization problem.

Wow, how great is that? 🤗

↑ Table of contents

Postgres query analyzing

As rightly noted in the comments, this article lacks some kind of analytics on query execution time. Well, I'll fix that by demonstrating a synthetic result PostgreSQL EXPLAIN function with ANALYSE method.

The test will involve three queries:

Two simple SELECT of the project and all the tasks of the project (by ID, which I put in the INDEX);
A complex query with two LEFT JOIN tables and create an aggregate JSONB object (for ease of output, without resorting to conversions within Golang, only built-in means Postgres 13).

☝️ Note: I specifically took a more complex query (rather than the one in the article above) to demonstrate how well and efficiently the Postgres database is able to perform queries.

There are 3 projects in my test table, each with 2 tasks. The database itself runs on an Apple MacBook Pro early 2015 (intel i5, 8 GB RAM) in a Docker container with the latest stable version of Postgres 13.x (13.4-1.pgdg100+1).

So, simple queries will look like this:

-- First simple query:

SELECT *
FROM
    projects
WHERE
    id = '6e609cb8-d62d-478b-8691-151d355af59d'
LIMIT 1

-- Second simple query:

SELECT *
FROM
    tasks
WHERE
    project_id = '6e609cb8-d62d-478b-8691-151d355af59d'

And here's a complex query:

-- Third complex query:

SELECT
    p.id,
    p.created_at,
    p.updated_at,
    p.project_status,
    p.project_attrs,
    jsonb_build_object(
        'user_id', u.id,
        'first_name', u.user_attrs->'first_name',
        'last_name', u.user_attrs->'last_name',
        'picture', u.user_attrs->'picture'
    ) AS author,
    COUNT(t.id) AS tasks_count,
    COALESCE(
        jsonb_agg(
            jsonb_build_object(
                'id', t.id,
                'status', t.task_status,
                'name', t.task_attrs->'name',
                'description', t.task_attrs->'description',
                'steps_count', jsonb_array_length(t.task_attrs->'steps')
            ) 
        )
        FILTER (WHERE t.project_id IS NOT NULL), '[]'
    ) AS tasks
FROM
    projects AS p
    LEFT JOIN users AS u ON u.id = p.user_id
    LEFT JOIN tasks AS t ON t.project_id = p.id
WHERE
    p.id = '6e609cb8-d62d-478b-8691-151d355af59d'
GROUP BY
    p.id,
    u.id
LIMIT 1

If you run these queries one by one with the EXPLAIN ANALYSE SELECT ... function, you can get the following results:

# For first simple query:

Limit  (cost=0.15..8.17 rows=1 width=84) (actual time=0.263..0.302 rows=1 loops=1)
  ->  Index Scan using projects_pkey on projects  (cost=0.15..8.17 rows=1 width=84) (actual time=0.238..0.248 rows=1 loops=1)
        Index Cond: (id = '6e609cb8-d62d-478b-8691-151d355af59d'::uuid)


Planning Time: 0.177 ms
Execution Time: 0.376 ms


# For second simple query:

Seq Scan on tasks  (cost=0.00..17.88 rows=3 width=100) (actual time=0.026..0.056 rows=2 loops=1)
  Filter: (project_id = '6e609cb8-d62d-478b-8691-151d355af59d'::uuid)


Planning Time: 0.180 ms
Execution Time: 0.139 ms


# For third complex query:

Limit  (cost=34.37..34.42 rows=1 width=156) (actual time=0.351..0.479 rows=1 loops=1)
  ->  GroupAggregate  (cost=34.37..34.42 rows=1 width=156) (actual time=0.333..0.437 rows=1 loops=1)
        Group Key: p.id, u.id
        ->  Sort  (cost=34.37..34.37 rows=1 width=184) (actual time=0.230..0.335 rows=2 loops=1)
              Sort Key: u.id
              Sort Method: quicksort  Memory: 27kB
              ->  Nested Loop Left Join  (cost=0.29..34.36 rows=1 width=184) (actual time=0.106..0.260 rows=2 loops=1)
                    Join Filter: (t.project_id = p.id)
                    ->  Nested Loop Left Join  (cost=0.29..16.44 rows=1 width=116) (actual time=0.063..0.128 rows=1 loops=1)
                          ->  Index Scan using projects_pkey on projects p  (cost=0.15..8.17 rows=1 width=84) (actual time=0.021..0.038 rows=1 loops=1)
                                Index Cond: (id = '6e609cb8-d62d-478b-8691-151d355af59d'::uuid)
                          ->  Index Scan using users_pkey on users u  (cost=0.14..8.16 rows=1 width=48) (actual time=0.014..0.022 rows=1 loops=1)
                                Index Cond: (id = p.user_id)
                    ->  Seq Scan on tasks t  (cost=0.00..17.88 rows=3 width=68) (actual time=0.018..0.043 rows=2 loops=1)
                          Filter: (project_id = '6e609cb8-d62d-478b-8691-151d355af59d'::uuid)


Planning Time: 0.226 ms
Execution Time: 0.585 ms

At first glance, it may seem that two simple queries are much more effective than this complicated one. But do not be fooled by the low values of Planning and Execution because this test does not consider the network lag of the request and the subsequent processing in your Go program!

Regarding a complex query, we have already generated an object “under the hood” through the built-in PostgreSQL functions, which we simply pass in a JSON response to the consumer.

Therefore, on a really high load will always win one complex request (which can also be cached, which I do in my projects) than many simple ones.

👌 Note: It's fine if there are only two queries, but I've refactored many projects where one endpoint went to the database more than 10 times with such “simple queries”. This caused the database to be constantly under a much greater load than one prepared complex query.

↑ Table of contents

Photos and videos by

Hannah Busing https://unsplash.com/photos/Zyx1bK9mqmA

P.S.

If you want more articles (like this) on this blog, then post a comment below and subscribe to me. Thanks! 😻

❗️ You can support me on Boosty, both on a permanent and on a one-time basis. All proceeds from this way will go to support my OSS projects and will energize me to create new products and articles for the community.

And of course, you can help me make developers' lives even better! Just connect to one of my projects as a contributor. It's easy!

My main projects that need your help (and stars) 👇

🔥 gowebly: A next-generation CLI tool that makes it easy to create amazing web applications with Go on the backend, using htmx, hyperscript or Alpine.js and the most popular CSS frameworks on the frontend.
✨ create-go-app: Create a new production-ready project with Go backend, frontend and deploy automation by running one CLI command.

Other my small projects: yatr, gosl, json2csv, csv2api.

Top comments (5)

MinhBLues • Aug 1 '23

why open connection database in every router instead of opening 1 connect in main.go

Vic Shóstak • Aug 8 '23

Thanks for this comment! 😊 This article only serves as an example to understand the topic. All Go-code examples have been prepared in order to introduce any reader more quickly. Don't judge strictly, please!

Yes, you are right: it is better to put the database connection in the start area of the application. Better yet, do it via DI (I use the google/wire package, for example).