Fix size estimation for parallel B-Tree scans with skip arrays

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Siddharth Kothari <[email protected]>
To: [email protected]
Cc: Vaibhav Jain <[email protected]>
Cc: Madhukar <[email protected]>
Cc: Xun Cheng <[email protected]>
Cc: [email protected]
Subject: Fix size estimation for parallel B-Tree scans with skip arrays
Date: Wed, 29 Apr 2026 12:24:21 +0530
Message-ID: <CAGCUe0Lwk3C0qdkBa+OLpYc7yXwW=pbaz8Sju4xMXEQAmyp+5g@mail.gmail.com> (raw)

Hi folks.

This commit
<https://github.com/postgres/postgres/commit/92fe23d93aa3bbbc40fca669cabc4a4d7975e327#diff-db0039b5ba...;
 introduced parallel scan skip support, however it underestimates the
required memory, causing it to write past the allocated shared memory
boundary. This can corrupt any entity using the adjacent shared memory
segment, leading to unpredictable behavior.

I reproduced the issue manually on stock postgres and raised a patch that
fixes it along with regress tests. In my repro, the issue manifested as
postgres server crashing unexpectedly.

Root cause:

In src/backend/access/nbtree/nbtree.c, the loop in
btestimateparallelscan assumes
that every index column might require a skip array and adds sizeof(int) to
the estimated size:

However, every skip array actually needs space for its slot in the
btps_arrElems array AND space to store its scan key's sk_flags. Therefore,
it requires sizeof(int) * 2.


The attached patch fixes this by allocating sizeof(int) * 2 per attribute
in btestimateparallelscan.


Please let me know your thoughts.


Thanks,

Siddharth Kothari


Attachments:

  [application/x-patch] 0001-Fix-size-estimation-for-parallel-B-Tree-scans-with-s.patch (12.5K, 3-0001-Fix-size-estimation-for-parallel-B-Tree-scans-with-s.patch)
  download | inline diff:
From c9e6673dfdfdf889b239c82a3a468a1b0172bba1 Mon Sep 17 00:00:00 2001
From: Siddharth Kothari <[email protected]>
Date: Fri, 24 Apr 2026 09:47:57 +0000
Subject: [PATCH] Fix size estimation for parallel B-Tree scans with skip
 arrays

Parallel B-Tree scans with skip arrays under-estimated the shared memory required when a wide index was used. This caused memory corruption in adjacent shared memory segments, which causes crashes in different ways depending on whichever entity is consuming the adjacent memory segment.

This patch fixes the allocation size calculation in btestimateparallelscan to correctly account for the array slots needed for all index columns.
---
 src/backend/access/nbtree/nbtree.c            |   5 +-
 src/test/regress/expected/select_parallel.out | 128 ++++++++++++++++++
 src/test/regress/sql/select_parallel.sql      | 121 +++++++++++++++++
 3 files changed, 252 insertions(+), 2 deletions(-)
 mode change 100644 => 100755 src/test/regress/expected/select_parallel.out
 mode change 100644 => 100755 src/test/regress/sql/select_parallel.sql

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 6d870e4ebe7..bbbd8eaf58e 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -603,9 +603,10 @@ btestimateparallelscan(Relation rel, int nkeys, int norderbys)
 		 * We make the conservative assumption that every index column will
 		 * also require a skip array.
 		 *
-		 * Every skip array must have space to store its scan key's sk_flags.
+		 * Every skip array must have space for its slot in the
+		 * btps_arrElems array, and space to store its scan key's sk_flags.
 		 */
-		estnbtreeshared = add_size(estnbtreeshared, sizeof(int));
+		estnbtreeshared = add_size(estnbtreeshared, sizeof(int) * 2);
 
 		/* Consider space required to store a datum of opclass input type */
 		attr = TupleDescCompactAttr(rel->rd_att, attnum - 1);
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
old mode 100644
new mode 100755
index 933921d1860..07ffcaef370
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -1435,3 +1435,131 @@ select parallel_workers_to_launch > :'parallel_workers_to_launch_before'  AS wrk
  t             | t
 (1 row)
 
+--
+-- Parallel Index Skip Scan Memory Corruption Test
+--
+-- The following test aims to expose a memory corruption issue in parallel index
+-- skip scans by using a wide index (32 columns) and specific query patterns
+-- that trigger skip scan logic in a parallel execution environment.
+CREATE TABLE global_inventory_1 (
+    region_id real,
+    store_type1 bigint,
+    store_type2 bigint,
+    store_type3 bigint,
+    c5 int, c6 int, c7 int, c8 int, c9 int,
+    c10 int, c11 int, c12 int, c13 int, c14 int, c15 int, c16 int, c17 int, c18 int, c19 int,
+    c20 int, c21 int, c22 int, c23 int, c24 int, c25 int, c26 int, c27 int, c28 int, c29 int,
+    c30 int, c31 int,
+    item_id timestamp with time zone,
+    payload text
+);
+CREATE TABLE global_inventory_2 (
+    region_id real,
+    store_type1 bigint,
+    store_type2 bigint,
+    store_type3 bigint,
+    c5 int, c6 int, c7 int, c8 int, c9 int,
+    c10 int, c11 int, c12 int, c13 int, c14 int, c15 int, c16 int, c17 int, c18 int, c19 int,
+    c20 int, c21 int, c22 int, c23 int, c24 int, c25 int, c26 int, c27 int, c28 int, c29 int,
+    c30 int, c31 int,
+    item_id timestamp with time zone,
+    payload text
+);
+INSERT INTO global_inventory_1 (region_id, store_type1, store_type2, store_type3, c5, c6, c7, 
+c8, c9, c10, c11, c12, c13, c14, c15, c16, c17, c18, c19, c20, c21, c22, c23, c24, c25, c26, 
+c27, c28, c29, c30, c31, item_id, payload)
+SELECT
+    (s * 1.1)::real,
+    (d1 * 100)::bigint,
+    (d2 * 1000)::bigint,
+    (d3 * -81985529216486896)::bigint,
+    s, s, s, s, s,
+    s, s, s, s, s, s, s, s, s, s,
+    s, s, s, s, s, s, s, s, s, s,
+    s, s,
+    t,
+    random()::text
+FROM
+    generate_series(1, 5) s,
+    generate_series(1, 4) d1,
+    generate_series(1, 4) d2,
+    generate_series(1, 1) d3,
+    generate_series('2026-01-01'::timestamptz, '2026-01-05'::timestamptz, '5 minute'::interval) t;
+INSERT INTO global_inventory_2 SELECT * FROM global_inventory_1;
+-- Create a 32-column B-tree index on table 1 to encourage skip scan
+CREATE INDEX idx_multi_skip_1 ON global_inventory_1 (region_id, store_type1, store_type2, store_type3, 
+c5, c6, c7, c8, c9, c10, c11, c12, c13, c14, c15, c16, c17, c18, c19, c20, c21, c22, c23, c24, c25, 
+c26, c27, c28, c29, c30, c31, item_id);
+CREATE INDEX idx_brin_2 ON global_inventory_2 USING brin (item_id);
+VACUUM ANALYZE global_inventory_1;
+VACUUM ANALYZE global_inventory_2;
+-- Set parameters to encourage parallel plans
+SET max_parallel_workers_per_gather = 4;
+SET min_parallel_table_scan_size = 0;
+SET min_parallel_index_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET enable_parallel_append = on;
+-- Disable sequential scan to force index usage
+SET enable_seqscan = off;
+SET random_page_cost = 1.0;
+-- Enable bitmap scan for BRIN index
+SET enable_bitmapscan = on;
+-- Create a wrapper function to filter EXPLAIN ANALYZE output for stability.
+CREATE FUNCTION parallel_skip_scan_explain_filter(text) RETURNS SETOF text
+LANGUAGE plpgsql AS
+$$
+DECLARE
+    ln text;
+BEGIN
+    FOR ln IN EXECUTE $1
+    LOOP
+        ln := regexp_replace(ln, '\m\d+(\.\d+)?\M', 'N', 'g');
+        CONTINUE WHEN (ln ~ ' +Buffers: .*');
+        CONTINUE WHEN (ln ~ ' +Planning: .*');
+        CONTINUE WHEN (ln ~ ' +Planning Time: .*');
+        CONTINUE WHEN (ln ~ ' +Execution Time: .*');
+        CONTINUE WHEN (ln ~ ' +Heap Blocks: .*');
+        RETURN NEXT ln;
+    END LOOP;
+END;
+$$;
+-- Run EXPLAIN on the test query.
+SELECT parallel_skip_scan_explain_filter('
+EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF)
+SELECT * FROM global_inventory_1 t1
+WHERE item_id IN (TIMESTAMP WITH TIME ZONE ''2026-01-01 06:03:57'', TIMESTAMP WITH TIME ZONE ''2026-01-02 06:23:28'')
+UNION ALL
+SELECT * FROM global_inventory_2 t2
+WHERE item_id IN (TIMESTAMP WITH TIME ZONE ''2026-01-01 06:03:57'', TIMESTAMP WITH TIME ZONE ''2026-01-02 06:23:28'')
+');
+                                                 parallel_skip_scan_explain_filter                                                 
+-----------------------------------------------------------------------------------------------------------------------------------
+ Gather (actual rows=N loops=N)
+   Workers Planned: N
+   Workers Launched: N
+   ->  Parallel Append (actual rows=N loops=N)
+         ->  Parallel Bitmap Heap Scan on global_inventory_2 t2 (actual rows=N loops=N)
+               Recheck Cond: (item_id = ANY ('{"Thu Jan N N:N:N N PST","Fri Jan N N:N:N N PST"}'::timestamp with time zone[]))
+               Rows Removed by Index Recheck: N
+               ->  Bitmap Index Scan on idx_brin_2 (actual rows=N loops=N)
+                     Index Cond: (item_id = ANY ('{"Thu Jan N N:N:N N PST","Fri Jan N N:N:N N PST"}'::timestamp with time zone[]))
+                     Index Searches: N
+         ->  Parallel Index Scan using idx_multi_skip_1 on global_inventory_1 t1 (actual rows=N loops=N)
+               Index Cond: (item_id = ANY ('{"Thu Jan N N:N:N N PST","Fri Jan N N:N:N N PST"}'::timestamp with time zone[]))
+               Index Searches: N
+ Planning:
+(14 rows)
+
+RESET max_parallel_workers_per_gather;
+RESET min_parallel_table_scan_size;
+RESET min_parallel_index_scan_size;
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET enable_parallel_append;
+RESET enable_seqscan;
+RESET random_page_cost;
+RESET enable_bitmapscan;
+DROP TABLE global_inventory_1;
+DROP TABLE global_inventory_2;
+DROP FUNCTION parallel_skip_scan_explain_filter(text);
diff --git a/src/test/regress/sql/select_parallel.sql b/src/test/regress/sql/select_parallel.sql
old mode 100644
new mode 100755
index 71a75bc86ea..bd57e3df76f
--- a/src/test/regress/sql/select_parallel.sql
+++ b/src/test/regress/sql/select_parallel.sql
@@ -588,3 +588,124 @@ select parallel_workers_to_launch > :'parallel_workers_to_launch_before'  AS wrk
        parallel_workers_launched > :'parallel_workers_launched_before' AS wrk_launched
   from pg_stat_database
   where datname = current_database();
+
+--
+-- Parallel Index Skip Scan Memory Corruption Test
+--
+-- The following test aims to expose a memory corruption issue in parallel index
+-- skip scans by using a wide index (32 columns) and specific query patterns
+-- that trigger skip scan logic in a parallel execution environment.
+CREATE TABLE global_inventory_1 (
+    region_id real,
+    store_type1 bigint,
+    store_type2 bigint,
+    store_type3 bigint,
+    c5 int, c6 int, c7 int, c8 int, c9 int,
+    c10 int, c11 int, c12 int, c13 int, c14 int, c15 int, c16 int, c17 int, c18 int, c19 int,
+    c20 int, c21 int, c22 int, c23 int, c24 int, c25 int, c26 int, c27 int, c28 int, c29 int,
+    c30 int, c31 int,
+    item_id timestamp with time zone,
+    payload text
+);
+
+CREATE TABLE global_inventory_2 (
+    region_id real,
+    store_type1 bigint,
+    store_type2 bigint,
+    store_type3 bigint,
+    c5 int, c6 int, c7 int, c8 int, c9 int,
+    c10 int, c11 int, c12 int, c13 int, c14 int, c15 int, c16 int, c17 int, c18 int, c19 int,
+    c20 int, c21 int, c22 int, c23 int, c24 int, c25 int, c26 int, c27 int, c28 int, c29 int,
+    c30 int, c31 int,
+    item_id timestamp with time zone,
+    payload text
+);
+
+INSERT INTO global_inventory_1 (region_id, store_type1, store_type2, store_type3, c5, c6, c7, 
+c8, c9, c10, c11, c12, c13, c14, c15, c16, c17, c18, c19, c20, c21, c22, c23, c24, c25, c26, 
+c27, c28, c29, c30, c31, item_id, payload)
+
+SELECT
+    (s * 1.1)::real,
+    (d1 * 100)::bigint,
+    (d2 * 1000)::bigint,
+    (d3 * -81985529216486896)::bigint,
+    s, s, s, s, s,
+    s, s, s, s, s, s, s, s, s, s,
+    s, s, s, s, s, s, s, s, s, s,
+    s, s,
+    t,
+    random()::text
+FROM
+    generate_series(1, 5) s,
+    generate_series(1, 4) d1,
+    generate_series(1, 4) d2,
+    generate_series(1, 1) d3,
+    generate_series('2026-01-01'::timestamptz, '2026-01-05'::timestamptz, '5 minute'::interval) t;
+
+INSERT INTO global_inventory_2 SELECT * FROM global_inventory_1;
+
+-- Create a 32-column B-tree index on table 1 to encourage skip scan
+CREATE INDEX idx_multi_skip_1 ON global_inventory_1 (region_id, store_type1, store_type2, store_type3, 
+c5, c6, c7, c8, c9, c10, c11, c12, c13, c14, c15, c16, c17, c18, c19, c20, c21, c22, c23, c24, c25, 
+c26, c27, c28, c29, c30, c31, item_id);
+
+CREATE INDEX idx_brin_2 ON global_inventory_2 USING brin (item_id);
+VACUUM ANALYZE global_inventory_1;
+VACUUM ANALYZE global_inventory_2;
+
+-- Set parameters to encourage parallel plans
+SET max_parallel_workers_per_gather = 4;
+SET min_parallel_table_scan_size = 0;
+SET min_parallel_index_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET enable_parallel_append = on;
+-- Disable sequential scan to force index usage
+SET enable_seqscan = off;
+SET random_page_cost = 1.0;
+-- Enable bitmap scan for BRIN index
+SET enable_bitmapscan = on;
+
+-- Create a wrapper function to filter EXPLAIN ANALYZE output for stability.
+CREATE FUNCTION parallel_skip_scan_explain_filter(text) RETURNS SETOF text
+LANGUAGE plpgsql AS
+$$
+DECLARE
+    ln text;
+BEGIN
+    FOR ln IN EXECUTE $1
+    LOOP
+        ln := regexp_replace(ln, '\m\d+(\.\d+)?\M', 'N', 'g');
+        CONTINUE WHEN (ln ~ ' +Buffers: .*');
+        CONTINUE WHEN (ln ~ ' +Planning: .*');
+        CONTINUE WHEN (ln ~ ' +Planning Time: .*');
+        CONTINUE WHEN (ln ~ ' +Execution Time: .*');
+        CONTINUE WHEN (ln ~ ' +Heap Blocks: .*');
+        RETURN NEXT ln;
+    END LOOP;
+END;
+$$;
+
+-- Run EXPLAIN on the test query.
+SELECT parallel_skip_scan_explain_filter('
+EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF)
+SELECT * FROM global_inventory_1 t1
+WHERE item_id IN (TIMESTAMP WITH TIME ZONE ''2026-01-01 06:03:57'', TIMESTAMP WITH TIME ZONE ''2026-01-02 06:23:28'')
+UNION ALL
+SELECT * FROM global_inventory_2 t2
+WHERE item_id IN (TIMESTAMP WITH TIME ZONE ''2026-01-01 06:03:57'', TIMESTAMP WITH TIME ZONE ''2026-01-02 06:23:28'')
+');
+
+RESET max_parallel_workers_per_gather;
+RESET min_parallel_table_scan_size;
+RESET min_parallel_index_scan_size;
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET enable_parallel_append;
+RESET enable_seqscan;
+RESET random_page_cost;
+RESET enable_bitmapscan;
+DROP TABLE global_inventory_1;
+DROP TABLE global_inventory_2;
+DROP FUNCTION parallel_skip_scan_explain_filter(text);
-- 
2.54.0.545.g6539524ca2-goog

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Fix size estimation for parallel B-Tree scans with skip arrays
  In-Reply-To: <CAGCUe0Lwk3C0qdkBa+OLpYc7yXwW=pbaz8Sju4xMXEQAmyp+5g@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox