Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wIav4-008F4z-0W for pgpool-hackers@arkaria.postgresql.org; Thu, 30 Apr 2026 23:35:46 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wIav3-009Ix0-14 for pgpool-hackers@arkaria.postgresql.org; Thu, 30 Apr 2026 23:35:45 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wIav3-009Iwi-0U for pgpool-hackers@lists.postgresql.org; Thu, 30 Apr 2026 23:35:45 +0000 Received: from meldrar.postgresql.org ([2a02:c0:301:0:ffff::31]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1wIav0-00000003bAO-0hZ1 for pgpool-hackers@lists.postgresql.org; Thu, 30 Apr 2026 23:35:44 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=postgresql.org; s=20171124; h=Content-Transfer-Encoding:Content-Type: Mime-Version:References:In-Reply-To:From:Subject:To:Message-Id:Date:Sender: Reply-To:Cc:Content-ID:Content-Description; bh=f6R5mU5/m9YatQz6msa+sxC4+2bJ0kshTwjqauMr+Jw=; b=KABvfC2Ki6fu8VYiZ3k/b63UxA GvrTMNaaw+f33LoGX2ikxecAq9XZqFnNpCy2MLtKXj+RceC69brEoIuPfXRsA9CqKvIUZA3gn900j 1uT1fx+Q99n9QwGzdZ5A6RCB3IdK1expaoAka8qP1xh+Cj2+R88Z+SIXOdrnnjtr2oBIV7extlMDv ax83WP44qExNTJTPl64hSLTQCX1UpLi7kSo2jV1WxwjjG2HFJjAQaPAKyAMBpJiIaWGHe3Sl0/zQ7 DyDM21hENuj1GHm/Fw8qe95V4r9X7U5OsEWejgh5EuIW0qB/TLcdODEoEgfAJakRQj99Cc7fp22lK 4djy8iFw==; Received: from [2409:11:4120:300:13ec:b688:ae16:bdde] (helo=localhost) by meldrar.postgresql.org with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wIaux-009eMj-0i for pgpool-hackers@lists.postgresql.org; Thu, 30 Apr 2026 23:35:41 +0000 Date: Fri, 01 May 2026 08:35:29 +0900 (JST) Message-Id: <20260501.083529.324263108638250692.ishii@postgresql.org> To: pgpool-hackers@lists.postgresql.org Subject: Re: 120.memory_leak_extended_memqcache fails on master From: Tatsuo Ishii In-Reply-To: <20260430.191509.901412343921235299.ishii@postgresql.org> References: <20260430.191509.901412343921235299.ishii@postgresql.org> X-Mailer: Mew version 6.8 on Emacs 29.3 Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Host-Lookup-Failed: Reverse DNS lookup failed for 2409:11:4120:300:13ec:b688:ae16:bdde (failed) List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk > Summary: > > - Recent 120.memory_leak_extended_memqcache regression test failures > on master branch do not mean we have memory leaks with query cache > and/or extended query processing. It's false positive because the > test script is not prepared for child process going away by FATAL > error. > > - The FATAL error is caused by an oversight in commit a350afd70 > "Feature: send sync message only to necessary backends.". > > - Proposed fix attached. > > Recently 120.memory_leak_extended_memqcache regression test > occasionally fails on master. So I dug into the buildfarm log and > found following error. Note that in this case the load balance node is > 1. If it is 0, the test succeeds. > > The test runs pgbench and the log below emitted at the end of > bench marking. When pgbench finishes, pgpool executes queries > registered in reset_query_list. "DISCARD ALL" comes from it. > > 2026-04-29 01:00:58.287: pgbench pid 62605: LOG: DB node id: 1 backend pid: 62625 statement: Sync > 2026-04-29 01:00:58.287: pgbench pid 62605: LOG: DB node id: 0 backend pid: 62624 statement: DISCARD ALL > 2026-04-29 01:00:58.287: pgbench pid 62605: LOG: DB node id: 1 backend pid: 62625 statement: DISCARD ALL > 2026-04-29 01:00:58.288: pgbench pid 62605: LOG: pool_send_and_wait: Error or notice message from backend: DB node id: 0 backend pid: 62624 statement: "DISCARD ALL" message: "DISCARD ALL cannot run inside a transaction block" > 2026-04-29 01:00:58.288: pgbench pid 62605: WARNING: packet kind of backend 1 ['C'] does not match with main/majority nodes packet kind ['E'] > 2026-04-29 01:00:58.288: pgbench pid 62605: FATAL: failed to read kind from backend > 2026-04-29 01:00:58.288: pgbench pid 62605: DETAIL: kind mismatch among backends. Possible last query was: "DISCARD ALL" kind details are:0[E: DISCARD ALL cannot run inside a transaction block] 1[C] > 2026-04-29 01:00:58.288: pgbench pid 62605: HINT: check data consistency among db nodes > > The FATAL error terminated child process 62605, caues the test > failure. From 129 test's test.sh: > > # run pgbench for a while > $PGBENCH -M extended -S -T 30 test > > after_size=`ps l $pid|tail -1|awk '{print $7}'` > delta=`expr $after_size - $init_size` > > echo "initial process size: $init_size after size: $after_size delta: $delta" > > test $delta -eq 0 > > if [ $? != 0 ];then > echo "memory leak in $delta KB in mode:$mode" > ./shutdownall > exit 1 > fi > > and from src/test/regression/log/120.memory_leak_extended_memqcache: > > expr: non-integer argument > initial process size: 156900 after size: VSZ delta: > ./test.sh: line 59: test: -eq: unary operator expected > memory leak in KB in mode:s > > Since ps command does not see the data of process 62605 any more, the > "test $delta -eq 0" command failed and it makes the script believe that > there's a memory leak. Probably we should fix test.sh here but I think > FATAL error should be fixed in the first place. Once we fix it, > there's no need to fix test.sh. So I think we can leave test.sh as it > is now. > > The reason why DISCARD ALL failed on node 0 PostgreSQL is, DISCARD ALL > executes inside a transaction block according to the log. However the > queries executed by pgbench do not use an explicit transactions. I was > puzzled. So I look into PostgreSQL log on node 0. Grepping related > process (62624) from the PostgreSQL log shows this: > > 62624 2026-04-29 01:00:28.288 UTC LOG: execute pgpool62605/pgpool62605: SELECT count(*) FROM pg_catalog.pg_class AS c, pg_catalog.pg_namespace AS n WHERE c.relname = 'pgbench_accounts' AND c.relnamespace = n.oid AND n.nspname ~ '^pg_temp_' > 62624 2026-04-29 01:00:29.068 UTC LOG: execute pgpool62605/pgpool62605: SELECT COALESCE(pg_catalog.to_regclass('"pgbench_accounts"')::oid, 0) > 62624 2026-04-29 01:00:58.287 UTC LOG: statement: DISCARD ALL > 62624 2026-04-29 01:00:58.287 UTC ERROR: DISCARD ALL cannot run inside a transaction block > 62624 2026-04-29 01:00:58.287 UTC STATEMENT: DISCARD ALL > > Those two execute commands were issued by do_query, which is > responsible for getting information for the relation cache > module. When pgpool runs under extended query mode and outside an > explicit transaction, do_query does not issue a sync message because > it could break unnamed portal. So in process 62624 on node 0 continues > to be in an extended query mode without sending sync. In this case > PostgreSQL keeps on opening an implicit transaction until sync message > is sent. Why pgpool did not send sync to node 0? > > Commit a350afd70 allows to send sync messages only to necessary > backend nodes. Any queries comes from clients are tracked for this > purpose. However it forgot to the case when do_query generates > internal queries. I think we should track such that queries, so that a > sync message is issued to the server which do_query sent to, when the > client sends a sync message. > > For this purpose I add a new member "bool > pending_sync_map[MAX_NUM_BACKENDS]" to session_context, which a node > id is set to when do_query sends a extended query to backend. Note > that it is set only when do_query is called outside an explicit > transaction. In add_sync_pending_message, which is called when client > sends a sync message, we check the map and set sync_map so that > subsequent call to SimpleForwardToBackend will send sync message to > the backend node. > > Attached patch implements this. Pushed to master branch only, since the commit a350afd70 had not been back-patched to stable branches (it's a new feature). Note that we haven't seen failure with 119.memory_leak_extended, which is pretty similar test to 120. The difference is, 119 uses only one DB node, which masks the bug of the commit. Regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp