Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rorbW-003moX-DF for pgsql-hackers@arkaria.postgresql.org; Mon, 25 Mar 2024 21:11:38 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1rorbU-0087BB-K7 for pgsql-hackers@arkaria.postgresql.org; Mon, 25 Mar 2024 21:11:37 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rorbU-0087B3-6T for pgsql-hackers@lists.postgresql.org; Mon, 25 Mar 2024 21:11:36 +0000 Received: from mail-pf1-x42d.google.com ([2607:f8b0:4864:20::42d]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1rorbQ-006Ol7-Ig for pgsql-hackers@lists.postgresql.org; Mon, 25 Mar 2024 21:11:35 +0000 Received: by mail-pf1-x42d.google.com with SMTP id d2e1a72fcca58-6e704078860so3374071b3a.0 for ; Mon, 25 Mar 2024 14:11:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1711401092; x=1712005892; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=GcgV+eNfvXiZH6ZzTzAIez4W+LOVd+wf6VIp9gkmBMQ=; b=IfxGmhhe4I0Z5hNGxvhrIK0ysakk9RWsk0s9d1/fLuQ0NmrGiKWIwHQpXS5N9sgxdm ITfkDa3qWQ5znSEcJhKz/Xjd8KRKbrs3M+yoWDfkwZPvLR5tzY7qeWT209IuhE3bCmOH UYfkgVFVR2Owk2K/8ISle8xbHJJJTG6KHXE7yYHUBJe09nKtuIDPQORCJL2hOH+KyTe7 FcTmiZ0Ile6O9Dr0dYvn5AYDKXMPKl1B6750Ddwkqm7lnzfqX+wN1sEFMewo6eyOQ+Q0 iUCLiap3qvx+Bxt2nO/Dtfkqi7VHgYDM56VxZjtjQaxc9r2TxBwmkeOxXKP6u8HD9auC QHVg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1711401092; x=1712005892; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=GcgV+eNfvXiZH6ZzTzAIez4W+LOVd+wf6VIp9gkmBMQ=; b=HJLYvQwiQHFiMMpmS3I81LCnd0cqj6Bcz6pzu5bTUP0OfYTjcUOeI4/uj/1YBqnL5m DNZbmRDRKw+V0eHSwCzPldnOC5Np0AsLVlpSFqa+ebz4Q/teTMPEFjJfu8AwOQjGDBzk Fz8vqFLoxTzcey/M9magtqenPn1T9kZpk59RKzR47snnvG0X4mS4L7KVdqoRWKE2DCDw XrVG6IYKLcARKHDSX6AJNwXK0VQw6E4AtYaReKCSM/5ExI3aEyOpNaO5YeYbgqtVXSDk k1Rnpfnvxg19c/B4eJc0Ic4R7Vp0XutkeqjS/5C68J0AVQAou1gM5V+qIh3zDkdr2NXn azog== X-Forwarded-Encrypted: i=1; AJvYcCXPtVMg19AcQ3Fo4Yq8OB+oTgUbaCixUBsaGDj3URhoMd0A2hbbRtwnWWLGfzqWNbzk5F/kxALuojX34lwk9hmrzCJswdQ3d7bI/D9bwbxZOMRY X-Gm-Message-State: AOJu0YxpWx2aRipRw5LXJ+IyU9jFjwdfQ1cW0p51fODcKS3QcsFfBkMq tgw06ihY+YFJND1GGI2FDZr5tc1Dz8HqncAxVeP1l2QjcZkE5Osk9FDO0/RL20KlzqOodemlrLS G0SXlsfx47TbscvZEtYWwZmUmqw0= X-Google-Smtp-Source: AGHT+IFz1F7Yu79Gdh44ZGJY6hKIOOZ6nYcxcxZyn8CmtMl50sjAcDSvhd6PbXHpLRjhykDdDMuBJ2NAY4mlj//g/Gg= X-Received: by 2002:a05:6a21:a581:b0:1a1:667c:eeca with SMTP id gd1-20020a056a21a58100b001a1667ceecamr7600577pzc.3.1711401092325; Mon, 25 Mar 2024 14:11:32 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Melanie Plageman Date: Mon, 25 Mar 2024 17:11:20 -0400 Message-ID: Subject: Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE To: Donghang Lin Cc: Tomas Vondra , Dilip Kumar , David Geier , PostgreSQL Developers , hlinnaka@iki.fi Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Mon, Mar 25, 2024 at 2:29=E2=80=AFAM Donghang Lin wrote: > > > > On Sat, Feb 17, 2024 at 2:31=E2=80=AFPM Tomas Vondra wrote: > > 2) Leader vs. worker counters > > > > It seems to me this does nothing to add the per-worker values from "Hea= p > > Blocks" into the leader, which means we get stuff like this: > > > > Heap Blocks: exact=3D102 lossy=3D10995 > > Worker 0: actual time=3D50.559..209.773 rows=3D215253 loops=3D1 > > Heap Blocks: exact=3D207 lossy=3D19354 > > Worker 1: actual time=3D50.543..211.387 rows=3D162934 loops=3D1 > > Heap Blocks: exact=3D161 lossy=3D14636 > > > > I think this is wrong / confusing, and inconsistent with what we do for > > other nodes. It's also inconsistent with how we deal e.g. with BUFFERS, > > where we *do* add the values to the leader: > > > > Heap Blocks: exact=3D125 lossy=3D10789 > > Buffers: shared hit=3D11 read=3D45420 > > Worker 0: actual time=3D51.419..221.904 rows=3D150437 loops=3D1 > > Heap Blocks: exact=3D136 lossy=3D13541 > > Buffers: shared hit=3D4 read=3D13541 > > Worker 1: actual time=3D56.610..222.469 rows=3D229738 loops=3D1 > > Heap Blocks: exact=3D209 lossy=3D20655 > > Buffers: shared hit=3D4 read=3D20655 > > > > Here it's not entirely obvious, because leader participates in the > > execution, but once we disable leader participation, it's clearer: > > > > Buffers: shared hit=3D7 read=3D45421 > > Worker 0: actual time=3D28.540..247.683 rows=3D309112 loops=3D1 > > Heap Blocks: exact=3D282 lossy=3D27806 > > Buffers: shared hit=3D4 read=3D28241 > > Worker 1: actual time=3D24.290..251.993 rows=3D190815 loops=3D1 > > Heap Blocks: exact=3D188 lossy=3D17179 > > Buffers: shared hit=3D3 read=3D17180 > > > > Not only is "Buffers" clearly a sum of per-worker stats, but the "Heap > > Blocks" simply disappeared because the leader does nothing and we don't > > print zeros. > > Heap Blocks is specific to Bitmap Heap Scan. It seems that node specific = stats > do not aggregate workers' stats into leaders for some existing nodes. For= example, > Memorize node for Hits, Misses, etc > > -> Nested Loop (actual rows=3D166667 loops=3D3) > -> Parallel Seq Scan on t (actual rows=3D33333 loops=3D3) > -> Memoize (actual rows=3D5 loops=3D100000) > Cache Key: t.j > Cache Mode: logical > Hits: 32991 Misses: 5 Evictions: 0 Overflows: 0 Memory= Usage: 2kB > Worker 0: Hits: 33551 Misses: 5 Evictions: 0 Overflows= : 0 Memory Usage: 2kB > Worker 1: Hits: 33443 Misses: 5 Evictions: 0 Overflows= : 0 Memory Usage: 2kB > -> Index Scan using uj on u (actual rows=3D5 loops=3D15) > Index Cond: (j =3D t.j) > > Sort, HashAggregate also do the same stuff. > > > 3) I'm not sure dealing with various EXPLAIN flags may not be entirely > > correct. Consider this: > > > > EXPLAIN (ANALYZE): > > > > -> Parallel Bitmap Heap Scan on t (...) > > Recheck Cond: (a < 5000) > > Rows Removed by Index Recheck: 246882 > > Worker 0: Heap Blocks: exact=3D168 lossy=3D15648 > > Worker 1: Heap Blocks: exact=3D302 lossy=3D29337 > > > > EXPLAIN (ANALYZE, VERBOSE): > > > > -> Parallel Bitmap Heap Scan on public.t (...) > > Recheck Cond: (t.a < 5000) > > Rows Removed by Index Recheck: 246882 > > Worker 0: actual time=3D35.067..300.882 rows=3D282108 loops= =3D1 > > Heap Blocks: exact=3D257 lossy=3D25358 > > Worker 1: actual time=3D32.827..302.224 rows=3D217819 loops= =3D1 > > Heap Blocks: exact=3D213 lossy=3D19627 > > > > EXPLAIN (ANALYZE, BUFFERS): > > > > -> Parallel Bitmap Heap Scan on t (...) > > Recheck Cond: (a < 5000) > > Rows Removed by Index Recheck: 246882 > > Buffers: shared hit=3D7 read=3D45421 > > Worker 0: Heap Blocks: exact=3D236 lossy=3D21870 > > Worker 1: Heap Blocks: exact=3D234 lossy=3D23115 > > > > EXPLAIN (ANALYZE, VERBOSE, BUFFERS): > > > > -> Parallel Bitmap Heap Scan on public.t (...) > > Recheck Cond: (t.a < 5000) > > Rows Removed by Index Recheck: 246882 > > Buffers: shared hit=3D7 read=3D45421 > > Worker 0: actual time=3D28.265..260.381 rows=3D261264 loops= =3D1 > > Heap Blocks: exact=3D260 lossy=3D23477 > > Buffers: shared hit=3D3 read=3D23478 > > Worker 1: actual time=3D28.224..261.627 rows=3D238663 loops= =3D1 > > Heap Blocks: exact=3D210 lossy=3D21508 > > Buffers: shared hit=3D4 read=3D21943 > > > > Why should the per-worker buffer info be shown when combined with the > > VERBOSE flag, and not just with BUFFERS, when the patch shows the > > per-worker info always? > > > > It seems that the general explain print framework requires verbose mode t= o show per worker stats. > For example, how Buffers hits, JIT are printed. While in some specific no= des which involves parallelism, > they always show worker blocks. This is why we see that some worker block= s don't have buffers > stats in non verbose mode. There are several existing nodes have the same= issue as what this > patch does: memorize, sort, hashaggregate. I don't think passing explain the BUFFERS option should impact what is shown for bitmap heap scan lossy/exact. BUFFERS has to do with buffer usage. "Heap Blocks" here is actually more accurately "Bitmap Blocks". 1) it is not heap specific, so at least changing it to "Table Blocks" would be better and 2) the meaning is about whether or not the blocks were represented lossily in the bitmap -- yes, those blocks that we are talking about are table blocks (in contrast to index blocks), but the important part is the bitmap. So, BUFFERS shouldn't cause this info to show. As for whether or not the leader number should be inclusive of all the worker numbers, if there is a combination of options in which per-worker stats are not displayed, then the leader count should be inclusive of the worker numbers. However, if the worker numbers are always displayed, I think it is okay for the leader number to only display its own count. Though I agree with Tomas that that can be confusing when parallel_leader_participation is off. You end up without a topline number. As for whether or not per-worker stats should be displayed by default or only with VERBOSE, it sounds like there are two different precedents. I don't have a strong feeling one way or the other. Whichever is most consistent. Donghang, could you list again which plan nodes and explain options always print per-worker stats and which only do with the VERBOSE option? I think there is an issue with the worker counts on rescan though. I was playing around with this patch with one of the regression test suite tables: drop table if exists tenk1; drop table if exists tenk2; CREATE TABLE tenk1 ( unique1 int4, unique2 int4, two int4, four int4, ten int4, twenty int4, hundred int4, thousand int4, twothousand int4, fivethous int4, tenthous int4, odd int4, even int4, stringu1 name, stringu2 name, string4 name ) with (autovacuum_enabled =3D false); COPY tenk1 FROM '/[source directory]/src/test/regress/data/tenk.data'; CREATE TABLE tenk2 AS SELECT * FROM tenk1; CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops); VACUUM ANALYZE tenk1; VACUUM ANALYZE tenk2; set enable_seqscan to off; set enable_indexscan to off; set enable_hashjoin to off; set enable_mergejoin to off; set enable_material to off; set parallel_setup_cost=3D0; set parallel_tuple_cost=3D0; set min_parallel_table_scan_size=3D0; set max_parallel_workers_per_gather=3D2; set parallel_leader_participation =3D off; explain (analyze, costs off, verbose) select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=3D0; I don't think the worker counts are getting carried across rescans. For this query, with parallel_leader_participation, you can see the leader has a high number of exact heap blocks (there are 30 rescans, so it is the sum across all of those). But the workers have low counts. That might be normal because maybe they did less work, but if you set parallel_leader_participation =3D off, you see the worker numbers are still small. I did some logging and I do see workers with counts of lossy/exact not making it into the final count. I haven't had time to debug more, but it is worth looking into. parallel_leader_participation =3D on ---------------------------------------------------------------------------= -------------------- Aggregate (actual time=3D62.321..63.178 rows=3D1 loops=3D1) Output: count(*) -> Nested Loop (actual time=3D2.058..55.202 rows=3D98000 loops=3D1) -> Seq Scan on public.tenk2 (actual time=3D0.182..3.699 rows=3D10= loops=3D1) Output: tenk2.unique1 ... Filter: (tenk2.thousand =3D 0) Rows Removed by Filter: 9990 -> Gather (actual time=3D1.706..4.142 rows=3D9800 loops=3D10) Workers Planned: 2 Workers Launched: 2 -> Parallel Bitmap Heap Scan on public.tenk1 (actual time=3D0.365..0.958 rows=3D3267 loops=3D30) Recheck Cond: (tenk1.hundred > 1) Heap Blocks: exact=3D1801 Worker 0: actual time=3D0.033..0.414 rows=3D1993 loop= s=3D10 Heap Blocks: exact=3D78 lossy=3D0 Worker 1: actual time=3D0.032..0.550 rows=3D2684 loop= s=3D10 Heap Blocks: exact=3D86 lossy=3D0 -> Bitmap Index Scan on tenk1_hundred (actual time=3D0.972..0.972 rows=3D9800 loops=3D10) Index Cond: (tenk1.hundred > 1) parallel_leader_participation =3D off ---------------------------------------------------------------------------= -------------------- Aggregate (actual time=3D84.502..84.977 rows=3D1 loops=3D1) Output: count(*) -> Nested Loop (actual time=3D6.185..77.085 rows=3D98000 loops=3D1) -> Seq Scan on public.tenk2 (actual time=3D0.182..3.709 rows=3D10= loops=3D1) Output: tenk2.unique1... Filter: (tenk2.thousand =3D 0) Rows Removed by Filter: 9990 -> Gather (actual time=3D5.265..6.355 rows=3D9800 loops=3D10) Workers Planned: 2 Workers Launched: 2 -> Parallel Bitmap Heap Scan on public.tenk1 (actual time=3D0.951..1.863 rows=3D4900 loops=3D20) Recheck Cond: (tenk1.hundred > 1) Worker 0: actual time=3D0.794..1.705 rows=3D4909 loop= s=3D10 Heap Blocks: exact=3D168 lossy=3D0 Worker 1: actual time=3D1.108..2.021 rows=3D4891 loop= s=3D10 Heap Blocks: exact=3D177 lossy=3D0 -> Bitmap Index Scan on tenk1_hundred (actual time=3D1.024..1.024 rows=3D9800 loops=3D10) Index Cond: (tenk1.hundred > 1) Worker 1: actual time=3D1.024..1.024 rows=3D9800 loops=3D10 - Melanie