Use-case: we have a set of meters
(e.g. gas, electricity, etc) that regularly record readings
. Our task is to retrieve the latest reading of every meter (or more generically phrased: retrieving the latest row per group
) using SQL.
SQL is a declarative language. Therefore, we achieve goal using it by expressing only what we want in the form of queries, instead of providing precise instructions about how to retrieve data. Then the database component called the planner will determine what's the best way to do that based on several factors such as table statistics. However, as we shall see, going for the first approach that comes to mind when writing queries might not yield the best performance. So ideally, we know as much as possible what the database does internally to retrieve our data, and what are our options to influence this towards the optimal access path for our use-cases.
In the sections below, to explain exactly what makes the alternatives I show you faster, I use as support visualisations of Postgres explain plans, showing how the database processes our queries internally. As you will see, some adjustments will make quite a difference - we will go from hundreds
of ms to about 5
. I sometimes thought of side notes to the main story to share with you that I've marked appropriately.
I'm using Postgres
version 16.2
(with its default configuration) running on my laptop .
Tables
First, we create the table of meters
.
create table meters
(
id bigint primary key generated always as identity,
);
Then, we create the table of readings
.
create table readings
(
id bigint primary key generated always as identity,
meter_id bigint,
date date,
reading double precision,
constraint fk__readings_meters foreign key (meter_id) references meters (id)
);
For enforcing integrity, the tables are linked together with a foreign key constraint. I am also not using serial when setting up the primary keys.
Generating test data
Let's now populate our tables with some rows: we add 500
meters, all having one reading every day for a year. For generating test data for situations like this, the generate_series function is invaluable.
insert into meters select * from generate_series(1, 500) seq;
insert into readings(meter_id, date, reading)
select m.id, seq, random() from generate_series('2024-02-01'::date, '2025-02-01'::date, '1 day'::interval) seq, meters m;
Querying
Our tables are now populated and ready to be queried. Let's get to it!
1. Window functions: 250ms
We first express that we want to partition our readings per meter IDs, and then use the row_number
function to assign an "order number" for every reading based on its age among its peers within an individual partition. You then use this to filter the result set and return only ones that are at the top (have row number equal to 1) per every partition.
explain(analyse, buffers)
with readings_with_rownums as (
select
meters.id as meter_id,
readings.reading as reading,
row_number() over (partition by meters.id order by readings.date desc) as rownum
from
readings
join
meters on meters.id = readings.meter_id
)
select
meter_id,
reading
from
readings_with_rownums
where
readings_with_rownums.rownum = 1;
This runs in about 250ms
. Actually, not that bad of a starting point. However, this timing does mean that the users will not perceive the response as being instantaneous (~100ms). Let's try to do better!
But how? Time to have a first look at the explain plan and try to get some clues.
It's more pleasant to look at explain plans using visualisation tools instead of in text form (the way we get them from Postgres itself by default). You have several great alternatives, here's some examples:
For the rest of this blog I'm going to use the one from Dalibo. Here's our first one:
A word on how to interpret them. The "flow" of data is from the bottom side upwards, as the arrow in the lower left part indicates. With other words, it should be read from bottom to top. I've also opened the relevant explain plan nodes which I want to explore further, and highlighted the row counts, which are the first thing that jumped out to me.
Looking at the row counts we can conclude the query efficiency is low. It reads, and then discards 99%
of the rows before returning the final results. You can see that in the Sort
node, where 183500
rows arrive as input from the nodes below, but then, only 500
are actually returned in the final result set. This way to get our results has low scalability, being very sensitive to increases in the dataset.
Side note: Let's also consider the overall resource utilisation for a moment too. When the database is performing sorts like this, especially over and over again (like in an application used by several users concurrently) on potentially very large tables, you will most probably see a spike in the CPU utilisation, like I did one time. If possible, we should consider limiting CPU usage for situations where there are actually no alternatives. One of the reasons to do this could be that some cloud provider databases don't let you get more CPU without a proportional increase in RAM or other resources.
Alright, time to move on. Let's proceed and look at other options.
2. DISTINCT ON: 250ms
I found this approach rather elegant actually, and I was rooting for it to perform better. To understand this works, you can have a look at the DISTINCT clause section in the Postgres docs. Here's how it looks like when we apply it to retrieve our readings:
explain (analyse, buffers)
select
distinct on (meters.id)
meters.id as meter_id,
readings.reading as reading
from
meters
join
readings on meters.id = readings.meter_id
order by
meters.id,
readings.date desc;
I didn't get a noticeable difference with regards to how fast it runs compared with the previous approach with the window functions. The explain plan looks quite similar to the one we have seen before:
We can observe a difference though. It doesn't contain the WindowAgg
node anymore you've seen before, in general it is indeed good to have less operations, but this didn't get us very far in our pursuit of reducing the query time. It's is still inefficient, reading a lot of rows and discarding the majority before returning the final results.
One thing I noticed in the explain plan is the following detail which is definitely worth taking a closer look at when you see it in your explain plans:
Sort Method: external merge Disk: 3968kB
This means that the sort operation is being slowed down because it is forced to use the disk. This happens when the work_mem
setting is too low given the size of the dataset. The sort can't be done fully in memory, so it has no choice but to "spill" the operation to disk. The default setting for work_mem
in Postgres is 4MB
, which in this case proves to be too little.
Let's increase this setting to make sure it's sufficient to do the sort in memory. But note that if you give it too much, it can be problematic because if the load increases a lot surprisingly, you can run out of memory - it's a balancing act.
set work_mem='16MB';
I don't have to restart Postgres for this to take effect. For other settings we have to.
We retry the query above, and I we can confirm that the sort does happen in memory now; the proof is that we can then the following in the explain plan:
Sort Method: quicksort Memory: 14951kB
Did this make a difference though? Not really - we're still at ~250ms
response time. Ah well, no problem, we still got other options.
Side note: This experiment is on my laptop, and the CPU / memory / storage are all on one machine. But for example when using Amazon RDS, the storage relies on another service - EBS, which is separated by a network to the database. So in those scenarios, avoiding disk "spills" like the one I showed you above will make more of a difference because the data has to "travel" more between memory and disk. You might also like to know that recently AWS introduced the Optimised Reads option, where for these operations, the database instance can use a fast local NVMe SSD disk instead of the network attached EBS volume, so the disk spills are less impactful.
3. MAX(DATE): 115ms
Time to switch gears again. What do you think of this approach, this time using SQL aggregate functions:
explain(analyse, buffers)
with latest_reads_per_meter as (
select
readings.meter_id as meter_id,
max(date) as reading_date
from
readings
group by
readings.meter_id
)
select
readings.meter_id,
readings.reading
from
meters
join
readings on meters.id = readings.meter_id
join
latest_reads_per_meter lrpm on lrpm.meter_id = meters.id
and readings.date = lrpm.reading_date;
As usual, the explain plan:
This looks a bit different than what we have seen before. In a good way! Let's understand why. The key is that we're now doing the sort much earlier
, and so consequently we're discarding the irrelevant rows earlier. It doesn't "carry them over" all the way throughout the execution process. This means less memory consumed because the intermediate results are smaller. However, does it speed up our query?
Indeed it does! Have a look at this.
Execution Time: 114.942 ms
Finally, some solid progress! We cut the runtime we started with in half. But can we do better? You bet, even reduce it by one order of magnitude.
4. Loose index scans: 13ms
Let's have a look at how the loose index scan works, since by now you're probably already wondering, when are you going to "bring in" the indexes?!
When creating indexes, the order of columns matters. The columns have to be defined exactly in this order I'm about to show you, with the "grouping" element first and then the other column which will be used for determining "latest" within a group.
create index idx on readings(meter_id, date);
We can then write the query like this.
explain (analyse, buffers)
select
meters.id as meter_id,
readings.reading as reading
from
meters
cross join lateral (
select
meter_id,
reading
from
readings
where
readings.meter_id = meters.id
order by
date desc
limit 1
) readings;
Execution Time: 13.814 ms
Alright, now we're talking! This is much faster, but let's find out why. For starters - you don't see the 183500
rows anywhere at all in the explain plan! We are also not sorting anything anymore, because the index keeps our data sorted already.
But, let's push the envelope to see how far we can take this. Let's open the IO & Buffers
tab of the index scan node in the above diagram and have a look in there. You don’t get buffers information by default, so make sure to use the buffers option. Here it is:
Let's take a step back and consider how the index-based retrieval works at a high level. There are essentially two steps. What happens is that as a first step, the B-tree index is traversed to determine the rows that match the query predicate, however, after this step, Postgres has to now actually go ahead and retrieve the rows from the table (or heap).
As we can see from the explain plan, to perform the two steps described above it amounts to 2000
blocks read, so about 16MiB
(2000 * 8 kibibytes). A block (or page) is the fundamental unit by which Postgres stores data, have a look here for details. I should also add that when you see Nested loop
nodes in the explain plans, you have to be careful to not mistakenly conclude that the buffer count displayed is of distinct buffers - if Postgres reads the same block several times it will simply add these up towards the total amount and not differentiate.
Let's try to put the 2000
blocks in context and try to interpret it a bit. For example we can ask ourselves: how many blocks does the table readings
have in total?
SELECT relpages FROM pg_class WHERE relname = 'readings';
relpages
----------
1350
Answer: 1350. So it seems to be reading more blocks with this index-based approach than are in the actual table! If we'd just do a sequential scan and simply read the entire table, sort, and discard, like you’ve seen in the approaches before this one, we'd read only 1350 blocks. What you're witnessing is a trade-off to watch out for and factor in your design. Using an index does lead to more speed (no sorts), but actually adds up to more I/O (more blocks touched) operations.
Side note: it is good to know that for example, if you're on Amazon Aurora or other cloud databases, you pay for every block that it has to retrieve from storage because it couldn't find it in memory. A nice cautionary tale about keeping the I/O under control on Aurora would be this one. However, this situation in my generated dataset is a byproduct of how "narrow" my tables are (small number of columns) - it's an artificial setup after all. If there would be more columns, then the number of pages in the table would be much larger, so the B-tree traversing would add up to comparatively much less blocks than there are in the actual table.
Let's see if we can reduce the number of blocks. You might wonder, can we avoid the extra step (reading from the table after reading from the index)? Exactly!
5. Loose index-only scans: 5ms
We can implement a so-called index-only scan
. To do this, we use the include
option when creating the index, like so:
create index idx_with_including on readings(meter_id, date) include (reading);
Let's now retry the query and look at the relevant tabs again.
Two important things to observe here. First, notice the Heap Fetches: 0
, which indicates that it does not go to the heap to get the rows because they are in the index already. Secondly, the total number of blocks is now 1501
, so 500 less than before. Another confirmation that it indeed doesn't read anything from the heap.
It can happen that you try this out, and don't see the Heap Fetches: 0
in your setup, and you might wonder why. This can happen when the visibility map is not up to date, and Postgres has no other option but to go to the heap to get the visibility information about a row. As the visibility map is kept up to date by the autovacuum
process, it is important to regularly visit the configuration settings to make sure it is able to keep up.
Let's look at the timing. How fast did we get it?
Execution Time: 5.448 ms
Great stuff! Let's now understand it a bit better how exactly the B-tree index is traversed for retrieving the data. In Postgres, we have three types of nodes in a B-tree: the root node, intermediate nodes (used only for traversal) and leaf nodes (these contain what we’re interested in: pointers to tuples on the heap or included data). Here's a diagram showing what happens, where each level contains the type of nodes I just mentioned.
For each row returned in the final result, it will do a number of 3 index page read operations - first for the root page, then for an intermediate B-tree page, and lastly it will read the leaf page, from where it will collect the included reading. In the diagram above, I have marked these steps with 1
, 2
, 3
, which are repeated 500 times.
Side note: including columns in the index does increase its overall size, which consequently increases the time needed to traverse it. In some cases, this might make the difference, and it will influence the overall retrieval timing in such a way that it is not beneficial anymore.
Conclusion
Quite the journey we've been on! After seeing all these querying techniques, we've finally arrived at our destination - you've seen a loose index-only scan in action, which gave us our results in the shortest time. Note though, that as the saying goes, there is no free lunch: every index gives the database more work to do at every insert, so you will have to decide on a case by case basis if it's worth it, using measurements. Also, the speedup does depend on the data distribution. In the scenario described above, we have many readings for every meter. Your mileage may vary. But it’s always worth a try!
References
- Stack Overflow.
- Loose Index Scan aka Skip Scan in PostgreSQL
- Select the Most Recent Record (of Many Items) With PostgreSQL
- PERFORMANCE TUNING: MAX AND GROUP BY
- Loose index scan Postgres wiki
- RE: Index Skip Scan
Thanks for reading!
Top comments (0)