scale parallel_tuple_cost by tuple width

public inbox for [email protected]  
help / color / mirror / Atom feed

scale parallel_tuple_cost by tuple width
6+ messages / 3 participants
[nested] [flat]

* scale parallel_tuple_cost by tuple width
@ 2026-03-30 13:27  Andrew Dunstan <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Andrew Dunstan @ 2026-03-30 13:27 UTC (permalink / raw)
  To: PostgreSQL Hackers <[email protected]>


While investigating a performance issue, I found that it was extremely 
difficult to get a parallel plan in some cases due to the fixed 
parallel_tuple_cost. But this cost is not really fixed - it's going to 
be larger for larger tuples. So this proposal adjusts the cost used 
according to how large we expect the results to be. The result is that 
in the common case where, say, you're getting a group id and some 
aggregates, a parallel plan is more likely to be chosen. By contrast, 
queries that generate very wide results will be less likely to choose 
parallel plans. The formula chosen does have a fixed cost piece built 
into it, which accounts for the shm_mq_sendv() and shm_mq_receive() 
synchronization that occurs regardless of width.

The patch itself is pretty simple.

Also attached is a benchmark report that I had claude create. Its main 
result shows a speedup of about 2.7x.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Benchmark: Width-Adjusted parallel_tuple_cost
==============================================
Timestamp: 2026-03-30T12:19:00Z
Patch: 0001-Scale-parallel_tuple_cost-by-tuple-width-at-Gather-n.patch
Base commit: 01d58d7e3ff (PostgreSQL 19devel)

Hardware
--------
Architecture: aarch64 (ARM64)
CPUs:         6
RAM:          11 GB
Disk:         SSD

PostgreSQL Configuration
------------------------
shared_buffers:                  2GB
work_mem:                        32MB
max_parallel_workers_per_gather: 4
max_parallel_workers:            8
parallel_tuple_cost:             0.1 (default)
io_method:                       sync
Build flags:                     -Dcassert=false -Db_ndebug=true -Dbuildtype=debugoptimized


Overview
--------
The parallel_tuple_cost GUC applies a flat per-tuple penalty to Gather
and Gather Merge nodes regardless of how wide the tuples are.  For
queries where partial aggregate results pass through the tuple queue,
these tuples are typically very narrow (8-52 bytes), but are charged the
same 0.1/tuple as wide rows.  This overcharges narrow-tuple Gathers and
can cause the planner to reject parallel plans that are 2-3x faster.

The patch scales parallel_tuple_cost by tuple width relative to a
100-byte reference, with a 10% fixed floor for irreducible queue
synchronization overhead:

    effective_cost = parallel_tuple_cost *
        (0.10 + 0.90 * max(width, 1) / 100)

Width 12 bytes  -> factor 0.208  -> effective cost 0.021/tuple
Width 52 bytes  -> factor 0.568  -> effective cost 0.057/tuple
Width 100 bytes -> factor 1.000  -> effective cost 0.100/tuple (unchanged)
Width 148 bytes -> factor 1.432  -> effective cost 0.143/tuple


Benchmark 1: Narrow-Output Aggregate (Plan Flip: Serial -> Parallel)
--------------------------------------------------------------------

Table setup:

    CREATE TABLE bench_wide AS
    SELECT
      i AS id,
      (i % 5000000) AS group_id,
      random() * 1000 AS val1,
      random() * 1000 AS val2,
      repeat('x', 200) AS padding
    FROM generate_series(1, 50000000) i;
    VACUUM ANALYZE bench_wide;

  50M rows, 12 GB on disk.
  5 columns: id int4 (4 bytes), group_id int4 (4 bytes),
             val1 float8 (8 bytes), val2 float8 (8 bytes),
             padding text (avg 204 bytes).
  5M distinct group_id values (10 rows per group).
  Source rows are wide (avg ~228 bytes) but the aggregate output is
  narrow: group_id + 3 aggregate accumulators = width 52 at Gather.

Query:

    SELECT group_id, count(*), sum(val1), avg(val2)
    FROM bench_wide
    GROUP BY group_id
    ORDER BY count(*) DESC
    LIMIT 10;

With 4 workers and 5M groups, this produces ~22.5M partial aggregate
rows (width 52) through Gather Merge.  The Gather Merge cost
contribution is the decisive factor:

  Unpatched: 0.1 * 22.5M         = 2,250,000
  Patched:   0.1 * 0.568 * 22.5M = 1,278,000  (43% less)

Results:

  UNPATCHED — planner chooses serial despite parallel being available:

    Limit  (cost=6734041..6734041 rows=10 width=28)
      ->  Sort  (cost=6734041..6748078 rows=5614666 width=28)
            ->  HashAggregate  (cost=5956600..6612710 rows=5614666 width=28)
                  Group Key: group_id
                  Planned Partitions: 32
                  ->  Seq Scan on bench_wide  (cost=0..2112919 rows=49999100 width=20)

    Execution times: 30783ms, 27386ms, 24555ms, 24826ms
    Median: ~26s

  PATCHED — planner now correctly chooses parallel:

    Limit  (cost=5753518..5753518 rows=10 width=28)
      ->  Sort  (cost=5753518..5767555 rows=5614666 width=28)
            ->  Finalize GroupAggregate  (cost=3468705..5632187 rows=5614666 width=28)
                  ->  Gather Merge  (cost=3468705..5337417 rows=22458664 width=52)
                        Workers Planned: 4
                        ->  Partial GroupAggregate  (cost=3467705..3680099 rows=5614666 width=52)
                              ->  Sort  (cost=3467705..3498955 rows=12499775 width=20)
                                    ->  Parallel Seq Scan on bench_wide  (cost=0..1737926 rows=12499775 width=20)

    Execution times: 9197ms, 9675ms, 9056ms, 11078ms
    Median: ~9.4s

  Verification — unpatched binary, parallel forced with parallel_tuple_cost=0.001:

    Same parallel plan structure as patched.
    Execution times: 9023ms, 9646ms, 9548ms, 11196ms
    Median: ~9.6s

  Summary:
    Unpatched (serial, planner's choice):   ~26s
    Patched   (parallel, planner's choice):  ~9.4s
    Speedup: 2.7x

    The parallel plan is genuinely faster.  The unpatched planner refused
    to pick it because the flat 0.1/tuple * 22.5M rows = 2.25M Gather
    cost made the parallel total (5.75M) appear close to the serial total
    (6.73M), and the serial plan avoided the Finalize GroupAggregate
    overhead.  With width adjustment, the Gather cost drops to 1.28M,
    making the parallel plan clearly cheaper.


Benchmark 2: Wide-Output Aggregate (No Regression)
---------------------------------------------------

Table setup:

    CREATE TABLE bench_narrow AS
    SELECT
      i AS id,
      (i % 500000) AS group_id,
      (random() * 1000)::numeric(10,2) AS val1,
      (random() * 1000)::numeric(10,2) AS val2,
      (random() * 1000)::numeric(10,2) AS val3,
      (random() * 1000)::numeric(10,2) AS val4,
      (random() * 1000)::numeric(10,2) AS val5,
      (random() * 1000)::numeric(10,2) AS val6,
      (random() * 1000)::numeric(10,2) AS val7,
      (random() * 1000)::numeric(10,2) AS val8
    FROM generate_series(1, 20000000) i;
    VACUUM ANALYZE bench_narrow;

  20M rows, 1776 MB on disk.
  10 columns: id int4 (4 bytes), group_id int4 (4 bytes),
              val1..val8 numeric(10,2) (avg 6 bytes each).
  500K distinct group_id values (40 rows per group).
  Source rows are narrow (avg ~52 bytes) but the aggregate output is
  wide: group_id + count + 4 sums + 4 avgs (each avg expands to
  sum + count internally) = width 268 at Gather Merge.

Query:

    SELECT group_id,
           count(*), sum(val1), sum(val2), sum(val3), sum(val4),
           avg(val5), avg(val6), avg(val7), avg(val8)
    FROM bench_narrow
    GROUP BY group_id
    ORDER BY count(*) DESC
    LIMIT 10;

With 4 workers and 500K groups, this produces ~2M partial aggregate
rows (width 268) through Gather Merge.  The width-adjusted cost
correctly charges MORE for these wide tuples:

  Unpatched Gather Merge: cost 1,372,038  (0.1 * 2.08M = 208K contribution)
  Patched   Gather Merge: cost 1,702,787  (0.1 * 2.412 * 2.08M = 502K contribution)

Both patched and unpatched choose the same parallel plan.

Results:

  UNPATCHED:

    Limit  (cost=1492668..1492668 rows=10 width=268)
      ->  Sort  (cost=1492668..1493970 rows=520832 width=268)
            ->  Finalize GroupAggregate  (cost=1122591..1481413 rows=520832 width=268)
                  ->  Gather Merge  (cost=1122591..1372038 rows=2083328 width=268)
                        Workers Planned: 4
                        ->  Sort  (cost=1121591..1122893 rows=520832 width=268)
                              ->  Partial HashAggregate  (cost=892982..1006267 rows=520832 width=268)
                                    ->  Parallel Seq Scan on bench_narrow  (cost=0..277330 rows=5000216 width=52)

    Execution times: 6525ms, 6575ms, 6447ms
    Median: ~6.5s

  PATCHED:

    Limit  (cost=1823417..1823417 rows=10 width=268)
      ->  Sort  (cost=1823417..1824719 rows=520832 width=268)
            ->  Finalize GroupAggregate  (cost=1122591..1812162 rows=520832 width=268)
                  ->  Gather Merge  (cost=1122591..1702787 rows=2083328 width=268)
                        Workers Planned: 4
                        ->  Sort  (cost=1121591..1122893 rows=520832 width=268)
                              ->  Partial HashAggregate  (cost=892982..1006267 rows=520832 width=268)
                                    ->  Parallel Seq Scan on bench_narrow  (cost=0..277330 rows=5000216 width=52)

    Execution times: 6784ms, 6869ms, 7047ms
    Median: ~6.9s

  Summary:
    Same plan on both.  Patched estimated cost is 22% higher (1.82M vs
    1.49M) because it correctly charges 2.41x the base rate for width-268
    tuples.  Execution times are within noise — the higher cost estimate
    does not cause a regression to serial.


Attachments:

  [text/x-patch] 0001-Scale-parallel_tuple_cost-by-tuple-width-at-Gather-n.patch (6.1K, 2-0001-Scale-parallel_tuple_cost-by-tuple-width-at-Gather-n.patch)
  download | inline diff:
From 4c944d4e7f4bbae7bdccd1949074e414a7b56b2e Mon Sep 17 00:00:00 2001
From: Andrew Dunstan <[email protected]>
Date: Mon, 30 Mar 2026 08:00:23 -0400
Subject: Scale parallel_tuple_cost by tuple width at Gather nodes

The parallel_tuple_cost GUC applies a flat per-tuple penalty to all
Gather and Gather Merge nodes, regardless of how wide or narrow the
tuples passing through the shared-memory queue actually are.  This
overcharges for narrow tuples (such as partial aggregate results with
a few integer columns) and undercharges for wide tuples.

The physical cost of the tuple queue is dominated by memcpy, which is
proportional to tuple width.  Introduce a width-based scaling factor
so that parallel_tuple_cost represents the cost at a reference width
of 100 bytes, with a 10% fixed floor for irreducible per-tuple queue
synchronization overhead.

For a Gather passing 12-byte partial aggregate tuples, the effective
per-tuple cost drops from 0.1 to ~0.02, which lets the planner choose
parallel plans for aggregation-heavy queries.

Tuples at the reference width (100 bytes) cost the same as before.
---
 src/backend/optimizer/path/costsize.c | 20 ++++++++++++--------
 src/backend/optimizer/plan/planner.c  |  4 +++-
 src/include/optimizer/cost.h          | 24 ++++++++++++++++++++++++
 3 files changed, 39 insertions(+), 9 deletions(-)

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 1c575e56ff6..695cded910a 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -11,7 +11,9 @@
  *	cpu_tuple_cost		Cost of typical CPU time to process a tuple
  *	cpu_index_tuple_cost  Cost of typical CPU time to process an index tuple
  *	cpu_operator_cost	Cost of CPU time to execute an operator or function
- *	parallel_tuple_cost Cost of CPU time to pass a tuple from worker to leader backend
+ *	parallel_tuple_cost Cost of CPU time to pass a tuple from worker to leader
+ *						backend.  Scaled by tuple width relative to a reference
+ *						width (see width_adjusted_parallel_tuple_cost).
  *	parallel_setup_cost Cost of setting up shared memory for parallelism
  *
  * We expect that the kernel will typically do some amount of read-ahead
@@ -446,9 +448,10 @@ cost_gather(GatherPath *path, PlannerInfo *root,

 	run_cost = path->subpath->total_cost - path->subpath->startup_cost;

-	/* Parallel setup and communication cost. */
+	/* Parallel setup and communication cost, scaled by tuple width. */
 	startup_cost += parallel_setup_cost;
-	run_cost += parallel_tuple_cost * path->path.rows;
+	run_cost += width_adjusted_parallel_tuple_cost(path->path.pathtarget->width) *
+		path->path.rows;

 	path->path.disabled_nodes = path->subpath->disabled_nodes
 		+ ((rel->pgs_mask & PGS_GATHER) != 0 ? 0 : 1);
@@ -509,13 +512,14 @@ cost_gather_merge(GatherMergePath *path, PlannerInfo *root,
 	run_cost += cpu_operator_cost * path->path.rows;

 	/*
-	 * Parallel setup and communication cost.  Since Gather Merge, unlike
-	 * Gather, requires us to block until a tuple is available from every
-	 * worker, we bump the IPC cost up a little bit as compared with Gather.
-	 * For lack of a better idea, charge an extra 5%.
+	 * Parallel setup and communication cost, scaled by tuple width.  Since
+	 * Gather Merge, unlike Gather, requires us to block until a tuple is
+	 * available from every worker, we bump the IPC cost up a little bit as
+	 * compared with Gather.  For lack of a better idea, charge an extra 5%.
 	 */
 	startup_cost += parallel_setup_cost;
-	run_cost += parallel_tuple_cost * path->path.rows * 1.05;
+	run_cost += width_adjusted_parallel_tuple_cost(path->path.pathtarget->width) *
+		path->path.rows * 1.05;

 	path->path.disabled_nodes = path->subpath->disabled_nodes
 		+ ((rel->pgs_mask & PGS_GATHER_MERGE) != 0 ? 0 : 1);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index d19800ad6a5..88d03ecfb4d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -580,7 +580,9 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
 		gather->plan.startup_cost = top_plan->startup_cost +
 			parallel_setup_cost;
 		gather->plan.total_cost = top_plan->total_cost +
-			parallel_setup_cost + parallel_tuple_cost * top_plan->plan_rows;
+			parallel_setup_cost +
+			width_adjusted_parallel_tuple_cost(top_plan->plan_width) *
+			top_plan->plan_rows;
 		gather->plan.plan_rows = top_plan->plan_rows;
 		gather->plan.plan_width = top_plan->plan_width;
 		gather->plan.parallel_aware = false;
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index f2fd5d31507..d7997779b3e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -175,6 +175,30 @@ extern void initial_cost_hashjoin(PlannerInfo *root,
 extern void final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 								JoinCostWorkspace *workspace,
 								JoinPathExtraData *extra);
+
+/*
+ * Width-adjusted parallel tuple cost.
+ *
+ * The cost of passing a tuple through the shared-memory tuple queue has a
+ * fixed component (queue synchronization, slot operations) and a variable
+ * component proportional to tuple width (memcpy into/out of the ring buffer).
+ * parallel_tuple_cost is calibrated for PARALLEL_TUPLE_COST_REF_WIDTH bytes;
+ * we scale proportionally so narrow tuples (e.g. partial aggregate results)
+ * are cheaper and wide tuples are more expensive.
+ *
+ * PARALLEL_TUPLE_COST_FIXED_FRAC is the irreducible per-tuple overhead
+ * (queue synchronization) as a fraction of the total cost at the reference
+ * width.
+ */
+#define PARALLEL_TUPLE_COST_REF_WIDTH	100 /* bytes */
+#define PARALLEL_TUPLE_COST_FIXED_FRAC	0.10	/* fixed overhead fraction */
+
+#define width_adjusted_parallel_tuple_cost(width) \
+	(parallel_tuple_cost * \
+	 (PARALLEL_TUPLE_COST_FIXED_FRAC + \
+	  (1.0 - PARALLEL_TUPLE_COST_FIXED_FRAC) * \
+	  (double) Max((width), 1) / PARALLEL_TUPLE_COST_REF_WIDTH))
+
 extern void cost_gather(GatherPath *path, PlannerInfo *root,
 						RelOptInfo *rel, ParamPathInfo *param_info, double *rows);
 extern void cost_gather_merge(GatherMergePath *path, PlannerInfo *root,
--
2.43.0


  [text/plain] parallel-tuple-cost-benchmark-2026-03-30.txt (8.1K, 3-parallel-tuple-cost-benchmark-2026-03-30.txt)
  download | inline:
Benchmark: Width-Adjusted parallel_tuple_cost
==============================================
Timestamp: 2026-03-30T12:19:00Z
Patch: 0001-Scale-parallel_tuple_cost-by-tuple-width-at-Gather-n.patch
Base commit: 01d58d7e3ff (PostgreSQL 19devel)

Hardware
--------
Architecture: aarch64 (ARM64)
CPUs:         6
RAM:          11 GB
Disk:         SSD

PostgreSQL Configuration
------------------------
shared_buffers:                  2GB
work_mem:                        32MB
max_parallel_workers_per_gather: 4
max_parallel_workers:            8
parallel_tuple_cost:             0.1 (default)
io_method:                       sync
Build flags:                     -Dcassert=false -Db_ndebug=true -Dbuildtype=debugoptimized


Overview
--------
The parallel_tuple_cost GUC applies a flat per-tuple penalty to Gather
and Gather Merge nodes regardless of how wide the tuples are.  For
queries where partial aggregate results pass through the tuple queue,
these tuples are typically very narrow (8-52 bytes), but are charged the
same 0.1/tuple as wide rows.  This overcharges narrow-tuple Gathers and
can cause the planner to reject parallel plans that are 2-3x faster.

The patch scales parallel_tuple_cost by tuple width relative to a
100-byte reference, with a 10% fixed floor for irreducible queue
synchronization overhead:

    effective_cost = parallel_tuple_cost *
        (0.10 + 0.90 * max(width, 1) / 100)

Width 12 bytes  -> factor 0.208  -> effective cost 0.021/tuple
Width 52 bytes  -> factor 0.568  -> effective cost 0.057/tuple
Width 100 bytes -> factor 1.000  -> effective cost 0.100/tuple (unchanged)
Width 148 bytes -> factor 1.432  -> effective cost 0.143/tuple


Benchmark 1: Narrow-Output Aggregate (Plan Flip: Serial -> Parallel)
--------------------------------------------------------------------

Table setup:

    CREATE TABLE bench_wide AS
    SELECT
      i AS id,
      (i % 5000000) AS group_id,
      random() * 1000 AS val1,
      random() * 1000 AS val2,
      repeat('x', 200) AS padding
    FROM generate_series(1, 50000000) i;
    VACUUM ANALYZE bench_wide;

  50M rows, 12 GB on disk.
  5 columns: id int4 (4 bytes), group_id int4 (4 bytes),
             val1 float8 (8 bytes), val2 float8 (8 bytes),
             padding text (avg 204 bytes).
  5M distinct group_id values (10 rows per group).
  Source rows are wide (avg ~228 bytes) but the aggregate output is
  narrow: group_id + 3 aggregate accumulators = width 52 at Gather.

Query:

    SELECT group_id, count(*), sum(val1), avg(val2)
    FROM bench_wide
    GROUP BY group_id
    ORDER BY count(*) DESC
    LIMIT 10;

With 4 workers and 5M groups, this produces ~22.5M partial aggregate
rows (width 52) through Gather Merge.  The Gather Merge cost
contribution is the decisive factor:

  Unpatched: 0.1 * 22.5M         = 2,250,000
  Patched:   0.1 * 0.568 * 22.5M = 1,278,000  (43% less)

Results:

  UNPATCHED — planner chooses serial despite parallel being available:

    Limit  (cost=6734041..6734041 rows=10 width=28)
      ->  Sort  (cost=6734041..6748078 rows=5614666 width=28)
            ->  HashAggregate  (cost=5956600..6612710 rows=5614666 width=28)
                  Group Key: group_id
                  Planned Partitions: 32
                  ->  Seq Scan on bench_wide  (cost=0..2112919 rows=49999100 width=20)

    Execution times: 30783ms, 27386ms, 24555ms, 24826ms
    Median: ~26s

  PATCHED — planner now correctly chooses parallel:

    Limit  (cost=5753518..5753518 rows=10 width=28)
      ->  Sort  (cost=5753518..5767555 rows=5614666 width=28)
            ->  Finalize GroupAggregate  (cost=3468705..5632187 rows=5614666 width=28)
                  ->  Gather Merge  (cost=3468705..5337417 rows=22458664 width=52)
                        Workers Planned: 4
                        ->  Partial GroupAggregate  (cost=3467705..3680099 rows=5614666 width=52)
                              ->  Sort  (cost=3467705..3498955 rows=12499775 width=20)
                                    ->  Parallel Seq Scan on bench_wide  (cost=0..1737926 rows=12499775 width=20)

    Execution times: 9197ms, 9675ms, 9056ms, 11078ms
    Median: ~9.4s

  Verification — unpatched binary, parallel forced with parallel_tuple_cost=0.001:

    Same parallel plan structure as patched.
    Execution times: 9023ms, 9646ms, 9548ms, 11196ms
    Median: ~9.6s

  Summary:
    Unpatched (serial, planner's choice):   ~26s
    Patched   (parallel, planner's choice):  ~9.4s
    Speedup: 2.7x

    The parallel plan is genuinely faster.  The unpatched planner refused
    to pick it because the flat 0.1/tuple * 22.5M rows = 2.25M Gather
    cost made the parallel total (5.75M) appear close to the serial total
    (6.73M), and the serial plan avoided the Finalize GroupAggregate
    overhead.  With width adjustment, the Gather cost drops to 1.28M,
    making the parallel plan clearly cheaper.


Benchmark 2: Wide-Output Aggregate (No Regression)
---------------------------------------------------

Table setup:

    CREATE TABLE bench_narrow AS
    SELECT
      i AS id,
      (i % 500000) AS group_id,
      (random() * 1000)::numeric(10,2) AS val1,
      (random() * 1000)::numeric(10,2) AS val2,
      (random() * 1000)::numeric(10,2) AS val3,
      (random() * 1000)::numeric(10,2) AS val4,
      (random() * 1000)::numeric(10,2) AS val5,
      (random() * 1000)::numeric(10,2) AS val6,
      (random() * 1000)::numeric(10,2) AS val7,
      (random() * 1000)::numeric(10,2) AS val8
    FROM generate_series(1, 20000000) i;
    VACUUM ANALYZE bench_narrow;

  20M rows, 1776 MB on disk.
  10 columns: id int4 (4 bytes), group_id int4 (4 bytes),
              val1..val8 numeric(10,2) (avg 6 bytes each).
  500K distinct group_id values (40 rows per group).
  Source rows are narrow (avg ~52 bytes) but the aggregate output is
  wide: group_id + count + 4 sums + 4 avgs (each avg expands to
  sum + count internally) = width 268 at Gather Merge.

Query:

    SELECT group_id,
           count(*), sum(val1), sum(val2), sum(val3), sum(val4),
           avg(val5), avg(val6), avg(val7), avg(val8)
    FROM bench_narrow
    GROUP BY group_id
    ORDER BY count(*) DESC
    LIMIT 10;

With 4 workers and 500K groups, this produces ~2M partial aggregate
rows (width 268) through Gather Merge.  The width-adjusted cost
correctly charges MORE for these wide tuples:

  Unpatched Gather Merge: cost 1,372,038  (0.1 * 2.08M = 208K contribution)
  Patched   Gather Merge: cost 1,702,787  (0.1 * 2.412 * 2.08M = 502K contribution)

Both patched and unpatched choose the same parallel plan.

Results:

  UNPATCHED:

    Limit  (cost=1492668..1492668 rows=10 width=268)
      ->  Sort  (cost=1492668..1493970 rows=520832 width=268)
            ->  Finalize GroupAggregate  (cost=1122591..1481413 rows=520832 width=268)
                  ->  Gather Merge  (cost=1122591..1372038 rows=2083328 width=268)
                        Workers Planned: 4
                        ->  Sort  (cost=1121591..1122893 rows=520832 width=268)
                              ->  Partial HashAggregate  (cost=892982..1006267 rows=520832 width=268)
                                    ->  Parallel Seq Scan on bench_narrow  (cost=0..277330 rows=5000216 width=52)

    Execution times: 6525ms, 6575ms, 6447ms
    Median: ~6.5s

  PATCHED:

    Limit  (cost=1823417..1823417 rows=10 width=268)
      ->  Sort  (cost=1823417..1824719 rows=520832 width=268)
            ->  Finalize GroupAggregate  (cost=1122591..1812162 rows=520832 width=268)
                  ->  Gather Merge  (cost=1122591..1702787 rows=2083328 width=268)
                        Workers Planned: 4
                        ->  Sort  (cost=1121591..1122893 rows=520832 width=268)
                              ->  Partial HashAggregate  (cost=892982..1006267 rows=520832 width=268)
                                    ->  Parallel Seq Scan on bench_narrow  (cost=0..277330 rows=5000216 width=52)

    Execution times: 6784ms, 6869ms, 7047ms
    Median: ~6.9s

  Summary:
    Same plan on both.  Patched estimated cost is 22% higher (1.82M vs
    1.49M) because it correctly charges 2.41x the base rate for width-268
    tuples.  Execution times are within noise — the higher cost estimate
    does not cause a regression to serial.

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: scale parallel_tuple_cost by tuple width
@ 2026-03-30 22:51  David Rowley <[email protected]>
  parent: Andrew Dunstan <[email protected]>
  0 siblings, 2 replies; 6+ messages in thread

From: David Rowley @ 2026-03-30 22:51 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; +Cc: Andrew Dunstan <[email protected]>; PostgreSQL Hackers <[email protected]>

On Tue, 31 Mar 2026 at 03:17, Tom Lane <[email protected]> wrote:
>
> Andrew Dunstan <[email protected]> writes:
> > While investigating a performance issue, I found that it was extremely
> > difficult to get a parallel plan in some cases due to the fixed
> > parallel_tuple_cost. But this cost is not really fixed - it's going to
> > be larger for larger tuples. So this proposal adjusts the cost used
> > according to how large we expect the results to be.
>
> Unfortunately, I'm afraid that this is going to produce mostly
> "garbage in, garbage out" estimates, because our opinion of how wide
> tuples-in-flight are is pretty shaky.  (See get_expr_width and
> particularly get_typavgwidth, and note that we only have good
> statistics-based numbers for plain Vars not function outputs.)
> I agree that it could be useful to have some kind of adjustment here,
> but I fear that linear scaling is putting way too much faith in the
> quality of the data.

(I suspect you're saying this because of the "Benchmark 2" in the text
file, which contains aggregates which return a varlena type, of which
we won't estimate the width very well for...)

Sure, it's certainly true that there are cases where we don't get the
width estimate right, but that's not stopped us using them before. So
why is this case so much more critical?  We now also have GROUP BY
before join abilities in the planner, which I suspect must also be
putting trust into the very same thing. Also, varlena-returning
Aggrefs aren't going to be the Gather/GatherMerge targetlist, so why
avoid making improvements in this area because we're not great at one
of the things that could be in the targetlist?

For the patch and the analysis: This reminds me of [1], where some
reverse-engineering of costs from query run-times was done, which
ended up figuring out what we set APPEND_CPU_COST_MULTIPLIER to. To
get that for this case would require various tests with different
tuple widths and ensuring that the costs scale linearly with the
run-time of the query with the patched version. Of course, the test
query would have to have perfect width estimates, but that could be
easy enough to do by populating a text typed GROUP BY column and
populating that with all the same width of text for each of the tests
before increasing the width for the next test, using a fixed-width
aggregate each time, e.g count(*). The "#define
PARALLEL_TUPLE_COST_REF_WIDTH 100" does seem to be quite a round
number. It would be good to know how close this is to reality.
Ideally, it would be good to see results from an Apple M<something>
chip and recent x86. In my experience, these produce very different
performance results, so it might be nice to find a value that is
somewhere in the middle of what we get from those machines. I suspect
having the GROUP BY column with text widths from 8 to 1024, increasing
in powers of two would be enough data points.

David

[1] https://postgr.es/m/CAKJS1f9UXdk6ZYyqbJnjFO9a9hyHKGW7B=ZRh-rxy9qxfPA5Gw@mail.gmail.com

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: scale parallel_tuple_cost by tuple width
@ 2026-03-30 23:31  David Rowley <[email protected]>
  parent: David Rowley <[email protected]>
  1 sibling, 0 replies; 6+ messages in thread

From: David Rowley @ 2026-03-30 23:31 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; +Cc: Andrew Dunstan <[email protected]>; PostgreSQL Hackers <[email protected]>

On Tue, 31 Mar 2026 at 12:00, Tom Lane <[email protected]> wrote:
> What I'm concerned about is that the estimated cost's dependency on
> tuple width may be much stronger here than it has been in other uses.
> That impression might be false, of course.

I think it's good to be concerned, but I think this is far from the
worst place to put trust in the width estimates. We also use them in
Memoize, and if we underestimate there, then we might end up with a
Nested Loop -> Memoize plan instead of a Hash or Merge Join. If the
actual Memoize cache hit ratio ends up much worse than expected due to
wider-than-expected tuples, then the chosen plan might be well off
being the optimal one. The execution costs of running a poorly chosen
Nested Loop with a poorly caching Memoize can become quadratic. I
think the parallel vs non-parallel problem is much more linear.

I'm more concerned about the opposite problem of being too liberal and
choosing parallel plans too often, resulting in worker exhaustion and
poorer performance as a result of serially executing parallel plans. I
suppose people could fix that by bumping up the parallel_setup_cost so
that the planner favours reserving parallel workers for plans that get
much bigger benefits from parallelisation.

David

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: scale parallel_tuple_cost by tuple width
@ 2026-04-01 11:15  Andrew Dunstan <[email protected]>
  parent: David Rowley <[email protected]>
  1 sibling, 1 reply; 6+ messages in thread

From: Andrew Dunstan @ 2026-04-01 11:15 UTC (permalink / raw)
  To: David Rowley <[email protected]>; Tom Lane <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>


On 2026-03-30 Mo 6:51 PM, David Rowley wrote:
> On Tue, 31 Mar 2026 at 03:17, Tom Lane <[email protected]> wrote:
>> Andrew Dunstan <[email protected]> writes:
>>> While investigating a performance issue, I found that it was extremely
>>> difficult to get a parallel plan in some cases due to the fixed
>>> parallel_tuple_cost. But this cost is not really fixed - it's going to
>>> be larger for larger tuples. So this proposal adjusts the cost used
>>> according to how large we expect the results to be.
>> Unfortunately, I'm afraid that this is going to produce mostly
>> "garbage in, garbage out" estimates, because our opinion of how wide
>> tuples-in-flight are is pretty shaky.  (See get_expr_width and
>> particularly get_typavgwidth, and note that we only have good
>> statistics-based numbers for plain Vars not function outputs.)
>> I agree that it could be useful to have some kind of adjustment here,
>> but I fear that linear scaling is putting way too much faith in the
>> quality of the data.
> (I suspect you're saying this because of the "Benchmark 2" in the text
> file, which contains aggregates which return a varlena type, of which
> we won't estimate the width very well for...)
>
> Sure, it's certainly true that there are cases where we don't get the
> width estimate right, but that's not stopped us using them before. So
> why is this case so much more critical?  We now also have GROUP BY
> before join abilities in the planner, which I suspect must also be
> putting trust into the very same thing. Also, varlena-returning
> Aggrefs aren't going to be the Gather/GatherMerge targetlist, so why
> avoid making improvements in this area because we're not great at one
> of the things that could be in the targetlist?
>
> For the patch and the analysis: This reminds me of [1], where some
> reverse-engineering of costs from query run-times was done, which
> ended up figuring out what we set APPEND_CPU_COST_MULTIPLIER to. To
> get that for this case would require various tests with different
> tuple widths and ensuring that the costs scale linearly with the
> run-time of the query with the patched version. Of course, the test
> query would have to have perfect width estimates, but that could be
> easy enough to do by populating a text typed GROUP BY column and
> populating that with all the same width of text for each of the tests
> before increasing the width for the next test, using a fixed-width
> aggregate each time, e.g count(*). The "#define
> PARALLEL_TUPLE_COST_REF_WIDTH 100" does seem to be quite a round
> number. It would be good to know how close this is to reality.
> Ideally, it would be good to see results from an Apple M<something>
> chip and recent x86. In my experience, these produce very different
> performance results, so it might be nice to find a value that is
> somewhere in the middle of what we get from those machines. I suspect
> having the GROUP BY column with text widths from 8 to 1024, increasing
> in powers of two would be enough data points.


I followed your suggested methodology to measure how Gather IPC
cost actually scales with tuple width.

Setup: 10M rows, 100K distinct text values per table, text column
padded to a fixed width with lpad().  Query: SELECT txt, count(*)
FROM ptc_bench_W GROUP BY txt.  This produces Partial HashAggregate
in workers, then Gather passes ~240K partial-aggregate tuples whose
width is dominated by the text column.  2 workers, work_mem=256MB,
cassert=off, debugoptimized build, aarch64 Linux.

I tested widths from 8 to 1024 bytes (10 data points).  For each
width, I ran 5 iterations of both parallel and serial execution,
and computed the Gather overhead as:

   overhead = T_parallel - T_serial / 3

This isolates the IPC cost: the serial time captures pure scan +
aggregate work, and dividing by 3 gives the ideal parallel time
(2 workers + leader).  The excess is Gather overhead.

Results (microseconds per tuple through Gather, median of 5 runs):

   Width(B)  us/tuple   Implied ptc (if ptc=0.1 at w=100)
   --------  --------   ----------------------------------
          8     0.30     0.032
         16     0.24     0.025
         32     0.77     0.083
         64     0.72     0.078
        128     1.03     0.111
        256     1.62     0.175
        384     2.90     0.313
        512     3.21     0.346
        768     4.12     0.445
       1024     5.56     0.600

The best-fit models:

   Linear:    cost(w) = 0.42 + 0.0051 * w     R² = 0.983
   Power law: cost(w) = 0.061 * w^0.63         R² = 0.966

Linear fits best.

One notable finding: at the proposed reference width of 100 bytes,
the total predicted cost is 0.42 + 0.51 = 0.93 us/tuple, of
which 0.42 is fixed.
The original patch used PARALLEL_TUPLE_COST_FIXED_FRAC = 0.10,
which substantially underestimates the width-independent component.
A higher fixed fraction would dampen the width adjustment, which
also partly addresses Tom's concern about sensitivity to width
estimate errors: with ~45% of the cost being fixed, even a 2x
error in width only translates to a ~1.5x error in total cost.

The script used to get the timings is attached.

cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com


Attachments:

  [application/x-shellscript] ptc_calibrate.sh (5.2K, 2-ptc_calibrate.sh)
  download

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: scale parallel_tuple_cost by tuple width
@ 2026-04-01 18:02  Tom Lane <[email protected]>
  parent: Andrew Dunstan <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Tom Lane @ 2026-04-01 18:02 UTC (permalink / raw)
  To: Andrew Dunstan <[email protected]>; +Cc: David Rowley <[email protected]>; PostgreSQL Hackers <[email protected]>

Andrew Dunstan <[email protected]> writes:
> I followed your suggested methodology to measure how Gather IPC
> cost actually scales with tuple width.

I ran your test script on two of my development machines and got:

Linux/x86_64:

Width    Parallel(ms)   Serial(ms)      Speedup  Gather rows
-----    ------------   ----------      -------  -----------
8             510.976     1219.453        2.39x       235706
16            532.123     1271.692        2.39x       235603
32            588.826     1356.428        2.30x       242062
64            674.485     1570.758        2.33x       239561
128           817.417     1887.202        2.31x       243158
256          1066.836     2548.100        2.39x       242304
384          1293.900     3038.905        2.35x       243941
512          1515.822     3573.144        2.36x       239064
768          1998.638     4725.448        2.36x       247558
1024         9865.460    22779.795        2.31x     10000000

macOS/M4-Pro:

Width    Parallel(ms)   Serial(ms)      Speedup  Gather rows
-----    ------------   ----------      -------  -----------
8             299.464      769.130        2.57x       242549
16            310.361      787.629        2.54x       243643
32            344.541      839.589        2.44x       242419
64            413.330      967.512        2.34x       238771
128           519.794     1185.757        2.28x       241440
256          1479.766     1823.559        1.23x       238615
384          2022.882     2326.823        1.15x       240617
512          2423.938     2778.995        1.15x       244752
768          3511.425     3934.384        1.12x       235814
1024         9905.073    12214.577        1.23x     10000000

It's not entirely clear to me how you reduced these numbers
to a ptc formula, but we should do that and see how the
results compare to your machine.

> The original patch used PARALLEL_TUPLE_COST_FIXED_FRAC = 0.10,
> which substantially underestimates the width-independent component.
> A higher fixed fraction would dampen the width adjustment, which
> also partly addresses Tom's concern about sensitivity to width
> estimate errors: with ~45% of the cost being fixed, even a 2x
> error in width only translates to a ~1.5x error in total cost.

That does make me feel better, assuming that we come out with
similar results on several machines.

> The script used to get the timings is attached.

If anyone else wants to try this with a platform having non-GNU
grep, you'll need these changes:

$ diff -pud ptc_calibrate.sh~ ptc_calibrate.sh
--- ptc_calibrate.sh~   2026-04-01 10:02:53.000000000 -0400
+++ ptc_calibrate.sh    2026-04-01 13:31:06.873772739 -0400
@@ -32,7 +32,8 @@ psql_cmd() {
 
 psql_cmd_timing() {
     "$PGBIN/psql" -h /tmp -p $PORT -d "$DB" -o /dev/null -qAt \
-        -c '\timing on' "$@" 2>&1 | grep -oP 'Time: \K[\d.]+' | tail -1
+        -c '\timing on' "$@" 2>&1 | \
+       sed -n 's/.*Time: \([0-9.][0-9.]*\).*/\1/p' | tail -1
 }
 
 # Create the database if needed
@@ -114,7 +115,7 @@ for W in $WIDTHS; do
 
     # Get Gather row count from EXPLAIN
     gather_rows=$(psql_cmd -c "$SET_CMDS EXPLAIN (COSTS ON) $Q;" \
-        | grep -oP 'Gather.*rows=\K\d+' | head -1)
+        | sed -n 's/.*Gather.*rows=\([0-9][0-9]*\).*/\1/p' | head -1)
     gather_rows=${gather_rows:-"?"}
 
     # Warm up (2 runs each)


			regards, tom lane





^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: scale parallel_tuple_cost by tuple width
@ 2026-04-01 19:27  Tom Lane <[email protected]>
  parent: Tom Lane <[email protected]>
  0 siblings, 0 replies; 6+ messages in thread

From: Tom Lane @ 2026-04-01 19:27 UTC (permalink / raw)
  To: Andrew Dunstan <[email protected]>; +Cc: David Rowley <[email protected]>; PostgreSQL Hackers <[email protected]>

I wrote:
> macOS/M4-Pro:

> Width    Parallel(ms)   Serial(ms)      Speedup  Gather rows
> -----    ------------   ----------      -------  -----------
> 8             299.464      769.130        2.57x       242549
> 16            310.361      787.629        2.54x       243643
> 32            344.541      839.589        2.44x       242419
> 64            413.330      967.512        2.34x       238771
> 128           519.794     1185.757        2.28x       241440
> 256          1479.766     1823.559        1.23x       238615
> 384          2022.882     2326.823        1.15x       240617
> 512          2423.938     2778.995        1.15x       244752
> 768          3511.425     3934.384        1.12x       235814
> 1024         9905.073    12214.577        1.23x     10000000

On closer review, it looks like I carelessly allowed this test
to run in parallel with a buildfarm run.  Here are numbers
with an idle machine:

Width    Parallel(ms)   Serial(ms)      Speedup  Gather rows
-----    ------------   ----------      -------  -----------
8             281.881      758.167        2.69x       242549
16            300.997      791.184        2.63x       243643
32            340.815      842.715        2.47x       242419
64            401.282      985.711        2.46x       238771
128           507.066     1183.727        2.33x       241440
256           718.008     1667.830        2.32x       238615
384          1774.601     2224.726        1.25x       240617
512          2439.593     2784.242        1.14x       244752
768          3254.088     3698.615        1.14x       235814
1024         8990.584    12176.341        1.35x     10000000

This is interesting because while the speedup ratio was
pretty insensitive to row width on the x86_64 box, that's
far from true on the Apple box.

			regards, tom lane





^ permalink  raw  reply  [nested|flat] 6+ messages in thread

end of thread, other threads:[~2026-04-01 19:27 UTC | newest]

Thread overview: 6+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-03-30 13:27 scale parallel_tuple_cost by tuple width Andrew Dunstan <[email protected]>
2026-03-30 22:51 ` David Rowley <[email protected]>
2026-03-30 23:31   ` David Rowley <[email protected]>
2026-04-01 11:15   ` Andrew Dunstan <[email protected]>
2026-04-01 18:02     ` Tom Lane <[email protected]>
2026-04-01 19:27       ` Tom Lane <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox