public inbox for [email protected]  
help / color / mirror / Atom feed
From: Donghang Lin <[email protected]>
To: Tomas Vondra <[email protected]>
Cc: Dilip Kumar <[email protected]>
Cc: David Geier <[email protected]>
Cc: PostgreSQL Developers <[email protected]>
Cc: [email protected]
Cc: [email protected]
Subject: Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE
Date: Sun, 24 Mar 2024 23:28:52 -0700
Message-ID: <CAA=D8a1ZGWTk0RDH-LyU==RsE-217T29eprS-NBhyj+eLHyRrQ@mail.gmail.com> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>
	<CAFiTN-v1yDvU=X+hwfJ+55=sbgDj=_kuvbduEG-F7=BjpWcnuw@mail.gmail.com>
	<[email protected]>

Hi Tomas,
>
> On Sat, Feb 17, 2024 at 2:31 PM Tomas Vondra <
[email protected]> wrote:
> Hi David,
>
> Do you plan to work continue working on this patch? I did take a look,
> and on the whole it looks reasonable - it modifies the right places etc.
>
> I think there are two things that may need an improvement:
>
> 1) Storing variable-length data in ParallelBitmapHeapState
>
> I agree with Robert the snapshot_and_stats name is not great. I see
> Dmitry mentioned phs_snapshot_off as used by ParallelTableScanDescData -
> the reasons are somewhat different (phs_snapshot_off exists because we
> don't know which exact struct will be allocated), while here we simply
> need to allocate two variable-length pieces of memory. But it seems like
> it would work nicely for this. That is, we could try adding an offset
> for each of those pieces of memory:
>
>  - snapshot_off
>  - stats_off
>
> I don't like the GetSharedSnapshotData name very much, it seems very
> close to GetSnapshotData - quite confusing, I think.
>
> Dmitry also suggested we might add a separate piece of shared memory. I
> don't quite see how that would work for ParallelBitmapHeapState, but I
> doubt it'd be simpler than having two offsets. I don't think the extra
> complexity (paid by everyone) would be worth it just to make EXPLAIN
> ANALYZE work.

This issue is now gone after Heikki's fix.

> 2) Leader vs. worker counters
>
> It seems to me this does nothing to add the per-worker values from "Heap
> Blocks" into the leader, which means we get stuff like this:
>
>     Heap Blocks: exact=102 lossy=10995
>     Worker 0:  actual time=50.559..209.773 rows=215253 loops=1
>        Heap Blocks: exact=207 lossy=19354
>     Worker 1:  actual time=50.543..211.387 rows=162934 loops=1
>        Heap Blocks: exact=161 lossy=14636
>
> I think this is wrong / confusing, and inconsistent with what we do for
> other nodes. It's also inconsistent with how we deal e.g. with BUFFERS,
> where we *do* add the values to the leader:
>
>     Heap Blocks: exact=125 lossy=10789
>     Buffers: shared hit=11 read=45420
>     Worker 0:  actual time=51.419..221.904 rows=150437 loops=1
>       Heap Blocks: exact=136 lossy=13541
>       Buffers: shared hit=4 read=13541
>     Worker 1:  actual time=56.610..222.469 rows=229738 loops=1
>       Heap Blocks: exact=209 lossy=20655
>       Buffers: shared hit=4 read=20655
>
> Here it's not entirely obvious, because leader participates in the
> execution, but once we disable leader participation, it's clearer:
>
>     Buffers: shared hit=7 read=45421
>     Worker 0:  actual time=28.540..247.683 rows=309112 loops=1
>       Heap Blocks: exact=282 lossy=27806
>       Buffers: shared hit=4 read=28241
>     Worker 1:  actual time=24.290..251.993 rows=190815 loops=1
>       Heap Blocks: exact=188 lossy=17179
>       Buffers: shared hit=3 read=17180
>
> Not only is "Buffers" clearly a sum of per-worker stats, but the "Heap
> Blocks" simply disappeared because the leader does nothing and we don't
> print zeros.

Heap Blocks is specific to Bitmap Heap Scan. It seems that node specific
stats
do not aggregate workers' stats into leaders for some existing nodes. For
example,
Memorize node for Hits, Misses, etc

   ->  Nested Loop (actual rows=166667 loops=3)
         ->  Parallel Seq Scan on t (actual rows=33333 loops=3)
         ->  Memoize (actual rows=5 loops=100000)
               Cache Key: t.j
               Cache Mode: logical
               Hits: 32991  Misses: 5  Evictions: 0  Overflows: 0  Memory
Usage: 2kB
               Worker 0:  Hits: 33551  Misses: 5  Evictions: 0  Overflows:
0  Memory Usage: 2kB
               Worker 1:  Hits: 33443  Misses: 5  Evictions: 0  Overflows:
0  Memory Usage: 2kB
               ->  Index Scan using uj on u (actual rows=5 loops=15)
                     Index Cond: (j = t.j)

Sort, HashAggregate also do the same stuff.

> 3) I'm not sure dealing with various EXPLAIN flags may not be entirely
> correct. Consider this:
>
> EXPLAIN (ANALYZE):
>
>    ->  Parallel Bitmap Heap Scan on t  (...)
>          Recheck Cond: (a < 5000)
>          Rows Removed by Index Recheck: 246882
>          Worker 0:  Heap Blocks: exact=168 lossy=15648
>          Worker 1:  Heap Blocks: exact=302 lossy=29337
>
> EXPLAIN (ANALYZE, VERBOSE):
>
>    ->  Parallel Bitmap Heap Scan on public.t  (...)
>          Recheck Cond: (t.a < 5000)
>          Rows Removed by Index Recheck: 246882
>          Worker 0:  actual time=35.067..300.882 rows=282108 loops=1
>            Heap Blocks: exact=257 lossy=25358
>          Worker 1:  actual time=32.827..302.224 rows=217819 loops=1
>            Heap Blocks: exact=213 lossy=19627
>
> EXPLAIN (ANALYZE, BUFFERS):
>
>    ->  Parallel Bitmap Heap Scan on t  (...)
>          Recheck Cond: (a < 5000)
>          Rows Removed by Index Recheck: 246882
>          Buffers: shared hit=7 read=45421
>          Worker 0:  Heap Blocks: exact=236 lossy=21870
>          Worker 1:  Heap Blocks: exact=234 lossy=23115
>
> EXPLAIN (ANALYZE, VERBOSE, BUFFERS):
>
>    ->  Parallel Bitmap Heap Scan on public.t  (...)
>          Recheck Cond: (t.a < 5000)
>          Rows Removed by Index Recheck: 246882
>          Buffers: shared hit=7 read=45421
>          Worker 0:  actual time=28.265..260.381 rows=261264 loops=1
>            Heap Blocks: exact=260 lossy=23477
>            Buffers: shared hit=3 read=23478
>          Worker 1:  actual time=28.224..261.627 rows=238663 loops=1
>            Heap Blocks: exact=210 lossy=21508
>            Buffers: shared hit=4 read=21943
>
> Why should the per-worker buffer info be shown when combined with the
> VERBOSE flag, and not just with BUFFERS, when the patch shows the
> per-worker info always?
>

It seems that the general explain print framework requires verbose mode to
show per worker stats.
For example, how Buffers hits, JIT are printed. While in some specific
nodes which involves parallelism,
they always show worker blocks. This is why we see that some worker blocks
don't have buffers
stats in non verbose mode. There are several existing nodes have the same
issue as what this
patch does: memorize, sort, hashaggregate.

> 4) Now that I think about this, isn't the *main* problem really that we
> don't display the sum of the per-worker stats (which I think is wrong)?
> I mean, we already can get the worker details VERBOSEm right? So the
> only reason to display that by default seems to be that it the values in
> "Heap Blocks" are from the leader only.

If we print aggregate Heap Blocks in the 'leader' block and show worker
stats only in
verbose mode, does it look better? In this way, it matches better with how
the
general framework prints workers stats, but it differs from some existing
nodes which
also print worker stats. As mentioned above, they are memorize, sort, and
hashaggregate nodes.
By the way, is that also a problem for these nodes?

> BTW doesn't this also suggest some of the code added to explain.c may
> not be quite necessary? Wouldn't it be enough to just "extend" the
> existing code printing per-worker stats. (I haven't tried, so maybe I'm
> wrong and we need the new code.)

We need the new code as they are node specific stats and we do call the
worker print function
to lay out the explain plan. I think the problem is in which mode we should
show worker blocks. This
is discussed in the above section.

v3 failed on master, attached a rebased version.

Regards,
Donghang Lin
(ServiceNow)


Attachments:

  [application/octet-stream] v4-0001-Parallel-Bitmap-Heap-Scan-reports-per-worker-stat.patch (11.3K, 3-v4-0001-Parallel-Bitmap-Heap-Scan-reports-per-worker-stat.patch)
  download | inline diff:
From cadcca528fb25986fd995b71e859cd69d6f483e4 Mon Sep 17 00:00:00 2001
From: David Geier <[email protected]>
Date: Tue, 8 Nov 2022 19:40:31 +0100
Subject: [PATCH v4] Parallel Bitmap Heap Scan reports per-worker stats

Similarly to other nodes (e.g. hash join, sort, memoize),
Bitmap Heap Scan now reports per-worker stats in the EXPLAIN
ANALYZE output. Previously only the heap blocks stats for the
leader were reported which was incomplete in parallel scans.

Discussion: https://www.postgresql.org/message-id/flat/b3d80961-c2e5-38cc-6a32-61886cdf766d%40gmail.com
---
 src/backend/commands/explain.c            | 45 +++++++++--
 src/backend/executor/execParallel.c       |  3 +
 src/backend/executor/nodeBitmapHeapscan.c | 93 ++++++++++++++++++++---
 src/include/executor/nodeBitmapHeapscan.h |  1 +
 src/include/nodes/execnodes.h             | 33 +++++++-
 5 files changed, 157 insertions(+), 18 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 926d70afaf..02251994c6 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3467,26 +3467,57 @@ show_hashagg_info(AggState *aggstate, ExplainState *es)
 static void
 show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es)
 {
+	Assert(es->analyze);
+
 	if (es->format != EXPLAIN_FORMAT_TEXT)
 	{
 		ExplainPropertyInteger("Exact Heap Blocks", NULL,
-							   planstate->exact_pages, es);
+							   planstate->stats.exact_pages, es);
 		ExplainPropertyInteger("Lossy Heap Blocks", NULL,
-							   planstate->lossy_pages, es);
+							   planstate->stats.lossy_pages, es);
 	}
 	else
 	{
-		if (planstate->exact_pages > 0 || planstate->lossy_pages > 0)
+		if (planstate->stats.exact_pages > 0 || planstate->stats.lossy_pages > 0)
 		{
 			ExplainIndentText(es);
 			appendStringInfoString(es->str, "Heap Blocks:");
-			if (planstate->exact_pages > 0)
-				appendStringInfo(es->str, " exact=%ld", planstate->exact_pages);
-			if (planstate->lossy_pages > 0)
-				appendStringInfo(es->str, " lossy=%ld", planstate->lossy_pages);
+			if (planstate->stats.exact_pages > 0)
+				appendStringInfo(es->str, " exact=%ld", planstate->stats.exact_pages);
+			if (planstate->stats.lossy_pages > 0)
+				appendStringInfo(es->str, " lossy=%ld", planstate->stats.lossy_pages);
 			appendStringInfoChar(es->str, '\n');
 		}
 	}
+
+	if (planstate->pstate != NULL)
+	{
+		for (int n = 0; n < planstate->sinstrument->num_workers; n++)
+		{
+			BitmapHeapScanInstrumentation *si = &planstate->sinstrument->sinstrument[n];
+
+			if (si->exact_pages == 0 && si->lossy_pages == 0)
+				continue;
+
+			if (es->workers_state)
+				ExplainOpenWorker(n, es);
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				ExplainIndentText(es);
+				appendStringInfo(es->str, "Heap Blocks: exact=%ld lossy=%ld\n",
+						 si->exact_pages, si->lossy_pages);
+			}
+			else
+			{
+				ExplainPropertyInteger("Exact Heap Blocks", NULL, si->exact_pages, es);
+				ExplainPropertyInteger("Lossy Heap Blocks", NULL, si->lossy_pages, es);
+			}
+
+			if (es->workers_state)
+				ExplainCloseWorker(n, es);
+		}
+	}
 }
 
 /*
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 8c53d1834e..bfb3419efb 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -1076,6 +1076,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_MemoizeState:
 			ExecMemoizeRetrieveInstrumentation((MemoizeState *) planstate);
 			break;
+		case T_BitmapHeapScanState:
+			ExecBitmapHeapRetrieveInstrumentation((BitmapHeapScanState *) planstate);
+			break;
 		default:
 			break;
 	}
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index cee7f45aab..0d7dd1f165 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -205,9 +205,9 @@ BitmapHeapNext(BitmapHeapScanState *node)
 			BitmapAdjustPrefetchIterator(node, tbmres);
 
 			if (tbmres->ntuples >= 0)
-				node->exact_pages++;
+				node->stats.exact_pages++;
 			else
-				node->lossy_pages++;
+				node->stats.lossy_pages++;
 
 			/*
 			 * We can skip fetching the heap page if we don't need any fields
@@ -647,6 +647,17 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
 {
 	TableScanDesc scanDesc;
 
+	/*
+	 * When ending a parallel worker, copy the statistics gathered by the
+	 * worker back into shared memory so that it can be picked up by the main
+	 * process to report in EXPLAIN ANALYZE.
+	 */
+	if (node->sinstrument != NULL && IsParallelWorker())
+	{
+		Assert(ParallelWorkerNumber <= node->sinstrument->num_workers);
+		node->sinstrument->sinstrument[ParallelWorkerNumber] = node->stats;
+	}
+
 	/*
 	 * extract information from the node
 	 */
@@ -716,8 +727,8 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	scanstate->return_empty_tuples = 0;
 	scanstate->vmbuffer = InvalidBuffer;
 	scanstate->pvmbuffer = InvalidBuffer;
-	scanstate->exact_pages = 0;
-	scanstate->lossy_pages = 0;
+	scanstate->stats.exact_pages = 0;
+	scanstate->stats.lossy_pages = 0;
 	scanstate->prefetch_iterator = NULL;
 	scanstate->prefetch_pages = 0;
 	scanstate->prefetch_target = 0;
@@ -840,7 +851,22 @@ void
 ExecBitmapHeapEstimate(BitmapHeapScanState *node,
 					   ParallelContext *pcxt)
 {
-	shm_toc_estimate_chunk(&pcxt->estimator, sizeof(ParallelBitmapHeapState));
+	Size		size;
+
+	/*
+	 * We store ParallelBitmapHeapState followed by
+	 * SharedBitmapHeapInstrumentationInfo in the same shmem chunk.
+	 */
+	size = MAXALIGN(sizeof(ParallelBitmapHeapState));
+
+	/* don't need this if not instrumenting or no workers */
+	if (node->ss.ps.instrument && pcxt->nworkers > 0)
+	{
+		size = add_size(size, offsetof(SharedBitmapHeapInstrumentation, sinstrument));
+		size = add_size(size, mul_size(pcxt->nworkers, sizeof(BitmapHeapScanInstrumentation)));
+	}
+
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 }
 
@@ -855,13 +881,27 @@ ExecBitmapHeapInitializeDSM(BitmapHeapScanState *node,
 							ParallelContext *pcxt)
 {
 	ParallelBitmapHeapState *pstate;
+	SharedBitmapHeapInstrumentation *sinstrument = NULL;
 	dsa_area   *dsa = node->ss.ps.state->es_query_dsa;
+	char	   *ptr;
+	Size		size;
 
 	/* If there's no DSA, there are no workers; initialize nothing. */
 	if (dsa == NULL)
 		return;
 
-	pstate = shm_toc_allocate(pcxt->toc, sizeof(ParallelBitmapHeapState));
+	size = MAXALIGN(sizeof(ParallelBitmapHeapState));
+	if (node->ss.ps.instrument && pcxt->nworkers > 0)
+	{
+		size = add_size(size, offsetof(SharedBitmapHeapInstrumentation, sinstrument));
+		size = add_size(size, mul_size(pcxt->nworkers, sizeof(BitmapHeapScanInstrumentation)));
+	}
+
+	ptr = shm_toc_allocate(pcxt->toc, size);
+	pstate = (ParallelBitmapHeapState *) ptr;
+	ptr += MAXALIGN(sizeof(ParallelBitmapHeapState));
+	if (node->ss.ps.instrument && pcxt->nworkers > 0)
+		sinstrument = (SharedBitmapHeapInstrumentation *) ptr;
 
 	pstate->tbmiterator = 0;
 	pstate->prefetch_iterator = 0;
@@ -874,8 +914,16 @@ ExecBitmapHeapInitializeDSM(BitmapHeapScanState *node,
 
 	ConditionVariableInit(&pstate->cv);
 
+	/* ensure any unfilled slots will contain zeroes */
+	if (sinstrument)
+	{
+		sinstrument->num_workers = pcxt->nworkers;
+		memset(sinstrument->sinstrument, 0, pcxt->nworkers * sizeof(BitmapHeapScanInstrumentation));
+	}
+
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pstate);
 	node->pstate = pstate;
+	node->sinstrument = sinstrument;
 }
 
 /* ----------------------------------------------------------------
@@ -917,10 +965,37 @@ void
 ExecBitmapHeapInitializeWorker(BitmapHeapScanState *node,
 							   ParallelWorkerContext *pwcxt)
 {
-	ParallelBitmapHeapState *pstate;
+	char	   *ptr;
 
 	Assert(node->ss.ps.state->es_query_dsa != NULL);
 
-	pstate = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
-	node->pstate = pstate;
+	ptr = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+
+	node->pstate = (ParallelBitmapHeapState *) ptr;
+	ptr += MAXALIGN(sizeof(ParallelBitmapHeapState));
+
+	if (node->ss.ps.instrument)
+		node->sinstrument = (SharedBitmapHeapInstrumentation *) ptr;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecBitmapHeapRetrieveInstrumentation
+ *
+ *		Transfer bitmap heap scan statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecBitmapHeapRetrieveInstrumentation(BitmapHeapScanState *node)
+{
+	SharedBitmapHeapInstrumentation *sinstrument = node->sinstrument;
+	Size		size;
+
+	if (sinstrument == NULL)
+		return;
+
+	size = offsetof(SharedBitmapHeapInstrumentation, sinstrument)
+		+ sinstrument->num_workers * sizeof(BitmapHeapScanInstrumentation);
+
+	node->sinstrument = palloc(size);
+	memcpy(node->sinstrument, sinstrument, size);
 }
diff --git a/src/include/executor/nodeBitmapHeapscan.h b/src/include/executor/nodeBitmapHeapscan.h
index ea003a9caa..446a664590 100644
--- a/src/include/executor/nodeBitmapHeapscan.h
+++ b/src/include/executor/nodeBitmapHeapscan.h
@@ -28,5 +28,6 @@ extern void ExecBitmapHeapReInitializeDSM(BitmapHeapScanState *node,
 										  ParallelContext *pcxt);
 extern void ExecBitmapHeapInitializeWorker(BitmapHeapScanState *node,
 										   ParallelWorkerContext *pwcxt);
+extern void ExecBitmapHeapRetrieveInstrumentation(BitmapHeapScanState *node);
 
 #endif							/* NODEBITMAPHEAPSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 1774c56ae3..8306e0c47b 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1735,6 +1735,12 @@ typedef struct BitmapIndexScanState
 	struct IndexScanDescData *biss_ScanDesc;
 } BitmapIndexScanState;
 
+typedef struct BitmapHeapScanInstrumentation
+{
+	long lossy_pages;
+	long exact_pages;
+} BitmapHeapScanInstrumentation;
+
 /* ----------------
  *	 SharedBitmapState information
  *
@@ -1778,6 +1784,20 @@ typedef struct ParallelBitmapHeapState
 	ConditionVariable cv;
 } ParallelBitmapHeapState;
 
+/* ----------------
+ *	 Instrumentation data for a parallel bitmap heap scan.
+ *
+ * During a parallel bitmap heap scan, this lives in shared memory.  At the
+ * end, each worker copies their own stats to the right slot.  Finally, the
+ * leader copies the data to its local memory
+  * ----------------
+ */
+typedef struct SharedBitmapHeapInstrumentation
+{
+	int			num_workers;
+	BitmapHeapScanInstrumentation sinstrument[FLEXIBLE_ARRAY_MEMBER];
+} SharedBitmapHeapInstrumentation;
+
 /* ----------------
  *	 BitmapHeapScanState information
  *
@@ -1812,8 +1832,6 @@ typedef struct BitmapHeapScanState
 	int			return_empty_tuples;
 	Buffer		vmbuffer;
 	Buffer		pvmbuffer;
-	long		exact_pages;
-	long		lossy_pages;
 	TBMIterator *prefetch_iterator;
 	int			prefetch_pages;
 	int			prefetch_target;
@@ -1822,6 +1840,17 @@ typedef struct BitmapHeapScanState
 	TBMSharedIterator *shared_tbmiterator;
 	TBMSharedIterator *shared_prefetch_iterator;
 	ParallelBitmapHeapState *pstate;
+
+	/*
+	 * In a parallelized bitmap heap scan, each worker sets their
+	 * instrumentaton data in pstate->sinstrument at the end.  The leader
+	 * copies the shared-memory info back to local storage before DSM
+	 * shutdown, and stores it here in 'instrument'.  It remains NULL in
+	 * workers and in non-parallel scans.
+	 */
+	SharedBitmapHeapInstrumentation *sinstrument;
+
+	BitmapHeapScanInstrumentation stats;
 } BitmapHeapScanState;
 
 /* ----------------
-- 
2.40.1



view thread (23+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE
  In-Reply-To: <CAA=D8a1ZGWTk0RDH-LyU==RsE-217T29eprS-NBhyj+eLHyRrQ@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox