TL;DR: if you want predictable results with LIMIT or FETCH FIRST in any database, you need an ORDER BY.
Here is a quick blog post to explain the following observation where running the same query provides different results even when the table is not changing:
I've set the Repeatable Read isolation level and locked the table in exclusive mode to guarantee no concurrent modifications.
Here is how you can reproduce it.
I created a one-million-row table in PostgreSQL:
postgres=# create table demo as select generate_series(1,1000000) n;
SELECT 1000000
postgres=# vacuum analyze demo;
VACUUM
It's not best practice to create a table without a primary key, but I don't need it here as I'll only do sequential scans.
The trick is that I've run a sequential scan in the background, in a loop for five seconds:
\! s=$SECONDS ; while [ $(( $SECONDS - $s )) -lt 5 ] ; do psql -c 'select * from demo' > /dev/null ; done & sleep 1
I run my select * from demo limit 3;
within those five seconds:
postgres=# begin transaction isolation level repeatable read;
BEGIN
postgres=*# lock table demo in exclusive mode;
LOCK TABLE
postgres=*# select * from demo limit 3;
n
--------
791905
791906
791907
(3 rows)
postgres=*# select * from demo limit 3;
n
--------
795521
795522
795523
(3 rows)
postgres=*# select * from demo limit 3;
n
--------
799137
799138
799139
(3 rows)
postgres=*# commit;
COMMIT
An SQL statement lacking an ORDER BY clause does not ensure any specific output order. Typically, a sequential scan begins at the start of the table, as it is a heap table without a defined logical order. However, if another sequential scan runs concurrently and is already filling the shared buffers, my sequential scan can jump on his back and fetch rows with it.
Once the other scan completes (at the end of the table), my sequential scan will continue from the beginning of the table until it reaches its starting point. This doesn't apply in my situation because the LIMIT clause only shows the first three rows, and the results I see are the first three rows currently read by the other session.
This feature dates back to PostgreSQL 8.3 and can be disabled with synchronize_seqscans
:
postgres=*# select * from demo limit 3;
n
--------
817217
817218
817219
(3 rows)
postgres=*# set synchronize_seqscans to off;
SET
postgres=*# select * from demo limit 3;
n
---
1
2
3
(3 rows)
postgres=*# select * from demo limit 3;
n
---
1
2
3
(3 rows)
postgres=*# commit;
COMMIT
This feature allows PostgreSQL to let two queries use one sequential scan of the same table, minimizing I/O during concurrent sequential scans when the number of pages is more significant than the shared buffers. PostgreSQL sequential scan reads into the buffer pool, which all sessions share, and synchronizing on it is probably faster than an I/O call.
Do you know any other databases that are doing that? Oracle Database does the opposite: it tries to avoid synchronization and bypasses the shared buffer pool for large table full table scan.
Top comments (0)