Date: Fri, 24 Oct 2025 13:01:48 +0200
From: hubert depesz lubaczewski <depesz@depesz.com>
Sender: depesz@depesz.com
To: PostgreSQL General <pgsql-general@lists.postgresql.org>
Subject: Why is this query touching 4gb of buffers?
Message-ID: <aPtcnI2BZkeHjRvF@depesz.com>
Reply-To: depesz@depesz.com
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
Archived-At: <https://www.postgresql.org/message-id/aPtcnI2BZkeHjRvF%40depesz.com>
Precedence: bulk

Hi,
I have weird-ish case, that I can't grok, or at least explain in
hand-wavy way.

Very simple query:

SELECT
    some_table.communication_channel_id,
    some_table.root_account_id
FROM
    some_schema.some_table
WHERE
    workflow_state = 'pending' AND
    send_at <= '2025-10-23 12:01:13';

On complex-ish table:

                                                      Table "some_schema.some_table"
             Column              │            Type             │ Collation │ Nullable │                              Default
═════════════════════════════════╪═════════════════════════════╪═══════════╪══════════╪═══════════════════════════════════════════════════════════════════
 id                              │ bigint                      │           │ not null │ nextval('some_schema.some_table_id_seq'::regclass)
 notification_id                 │ bigint                      │           │          │
 notification_policy_id          │ bigint                      │           │          │
 context_id                      │ bigint                      │           │          │
 context_type                    │ character varying(255)      │           │          │
 communication_channel_id        │ bigint                      │           │          │
 frequency                       │ character varying(255)      │           │          │
 workflow_state                  │ character varying(255)      │           │          │
 batched_at                      │ timestamp without time zone │           │          │
 created_at                      │ timestamp without time zone │           │          │
 updated_at                      │ timestamp without time zone │           │          │
 send_at                         │ timestamp without time zone │           │          │
 link                            │ text                        │           │          │
 name_of_topic                   │ text                        │           │          │
 summary                         │ text                        │           │          │
 root_account_id                 │ bigint                      │           │          │
 notification_policy_override_id │ bigint                      │           │          │
Indexes:
    "some_table_pkey" PRIMARY KEY, btree (id), tablespace "data1"
…
    "index_some_table_pending" btree (send_at) WHERE workflow_state::text = 'pending'::text, tablespace "data1"
…

Running this via explain shows:

 Index Scan using index_some_table_pending on some_table  (cost=0.43..399992.44 rows=2215063 width=16) (actual time=382.466..382.466 rows=0 loops=1)
   Index Cond: (send_at <= '2025-10-23 12:01:13'::timestamp without time zone)
   Buffers: shared hit=554347
 Planning Time: 0.082 ms
 Execution Time: 382.482 ms
(5 rows)

Specifically, I see that it returned 0 rows, and to do it, it had to "touch" 554k buffers - ~ 4GB of data.

Sizes of the relavant things:
                 oid                  │ relkind │ pg_relation_size │ pg_table_size │ pg_total_relation_size
══════════════════════════════════════╪═════════╪══════════════════╪═══════════════╪════════════════════════
 some_schema.some_table               │ r       │      15339020288 │   15376539648 │            22347988992
 some_schema.index_some_table_pending │ i       │         77127680 │      77168640 │               77168640
(2 rows)


I didn't check analyze time, so can't comment on this, but I did reindex of
this index, which shrunk it to 16973824 bytes, and now the query runs, as expected, in < 1ms:

 Index Scan using index_some_table_pending on some_table  (cost=0.43..392423.37 rows=2215272 width=16) (actual time=0.005..0.005 rows=0 loops=1)
   Index Cond: (send_at <= '2025-10-23 12:01:13'::timestamp without time zone)
   Buffers: shared hit=3
 Planning:
   Buffers: shared hit=3
 Planning Time: 0.787 ms
 Execution Time: 0.016 ms
(7 rows)

While talking with others, there have been used certain terms, like
"unbounded range", which I understand, but I still don't see why, with
the same stats, and NO rows returned by index - before index, pg has to
do stuff to 4GB of data?! Anyone could try to explain?

If it matters, this is rather old Pg: PostgreSQL 14.7, and the query was
running on streaming replica, but results on primary were basically the
same.

Best regards,

depesz