Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uTzor-00DfUk-42 for pgsql-general@arkaria.postgresql.org; Tue, 24 Jun 2025 09:19:57 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1uTzoo-00Afs0-EI for pgsql-general@arkaria.postgresql.org; Tue, 24 Jun 2025 09:19:55 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uTzon-00Afrr-SU for pgsql-general@lists.postgresql.org; Tue, 24 Jun 2025 09:19:54 +0000 Received: from forward501a.mail.yandex.net ([178.154.239.81]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1uTzol-003nVw-2Z for pgsql-general@lists.postgresql.org; Tue, 24 Jun 2025 09:19:54 +0000 Received: from mail-nwsmtp-smtp-production-main-95.vla.yp-c.yandex.net (mail-nwsmtp-smtp-production-main-95.vla.yp-c.yandex.net [IPv6:2a02:6b8:c1d:5915:0:640:b034:0]) by forward501a.mail.yandex.net (Yandex) with ESMTPS id 2E13A61FA3 for ; Tue, 24 Jun 2025 12:19:50 +0300 (MSK) Received: by mail-nwsmtp-smtp-production-main-95.vla.yp-c.yandex.net (smtp/Yandex) with ESMTPSA id nJPgQM9LfSw0-1GCJgUGS; Tue, 24 Jun 2025 12:19:49 +0300 X-Yandex-Fwd: 1 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex.ru; s=mail; t=1750756789; bh=63ruFBbtfP+ZL8n1ZeToHRvhmfaH5O1rEPRxhbnxHkc=; h=In-Reply-To:References:To:Subject:Message-ID:Date:From; b=sJ1yQYV7K6wDkliXLlzWvmCJsT8JhuieeW8SOOrbmtnqrq5MhIqdcxg6vb1ppDlKW JGizTgn61cv+g9kUPWOzFcuApxTI5Uh/uG3HzG6dOGrIhzgRnhCgCZrBEwQneASfFc S4TVWlKAE8n9+9RdV1HzEkosd1thfQPESqKQkIB0= Authentication-Results: mail-nwsmtp-smtp-production-main-95.vla.yp-c.yandex.net; dkim=pass header.i=@yandex.ru Content-Type: multipart/alternative; boundary="------------gsYLBxwUSsf5wF79wKwR0WHs" Message-ID: <21b95075-788d-4dfd-84e9-395aa9843944@yandex.ru> Date: Tue, 24 Jun 2025 12:19:49 +0300 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: IPC/MultixactCreation on the Standby server To: pgsql-general@lists.postgresql.org References: <173051750685430@mail.yandex.ru> Content-Language: en-US From: Dmitry In-Reply-To: <173051750685430@mail.yandex.ru> List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk This is a multi-part message in MIME format. --------------gsYLBxwUSsf5wF79wKwR0WHs Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit On 23.06.2025 16:33, Dmitry wrote: > Hi, > > The problem is as follows. > A replication cluster includes a primary server and one hot-standby replica. > The workload on the primary server is represented by multiple requests generating multixact IDs, while the hot-standby replica performs reading requests. > > After some time, all requests on the hot-standby are stuck and never get finished. > > The `pg_stat_activity` view on the replica reports that processes are stuck waiting for IPC/MultixactCreation, > pg_cancel_backend and pg_terminate_backend cannot cancel the request, SIGQUIT is the only way to stop it. > > We tried: > * changing the `autovacuum_multixact_freeze_max_age` parameters, > * increasing `multixact_member_buffers` and `multixact_offset_buffers`, > * disabling `hot_standby_feedback`, > * switching the replica to synchronous and asynchronous mode, > * and much more. > But nothing helped. > > We ran the replica in recovery mode from WAL archive, i.e. as warm-standby, the result is the same. > > We tried to build from the sources based on REL_17_5 branch with the default configure settings >    ./configure >    make >    make install > But got no luck. > > Here is an example with a synthetic workload reproducing the problem. > > Test system > =========== > > -   Architecture: x86_64 > -   OS: Ubuntu 24.04.2 LTS (Noble Numbat) > -   Tested postgres version(s): >    -   latest 17 (17.5) >    -   latest 18 (18-beta1) > > Steps to reproduce > ================== > >    postgres=# create table tbl ( >        id int primary key, >        val int >    ); >    postgres=# insert into tbl select i, 0 from generate_series(1,5) i; > > > The first and second scripts execute queries on the master server > ----------------------------------------------------------------- > >    pgbench --no-vacuum --report-per-command -M prepared -c 200 -j 200 -T 300 -P 1 --file=/dev/stdin <<'EOF' >    \set id random(1, 5) >    begin; >    select * from tbl where id = :id for key share; >    commit; >    EOF > >    pgbench --no-vacuum --report-per-command -M prepared -c 100 -j 100 -T 300 -P 1 --file=/dev/stdin <<'EOF' >    \set id random(1, 5) >    begin; >    update tbl set val = val+1 where id = :id; >    \sleep 10 ms >    commit; >    EOF > > > The following script is executed on the replica > ----------------------------------------------- > >    pgbench --no-vacuum --report-per-command -M prepared -c 100 -j 100 -T 300 -P 1 --file=/dev/stdin <<'EOF' >    begin; >    select sum(val) from tbl; >    \sleep 10 ms >    select sum(val) from tbl; >    \sleep 10 ms >    commit; >    EOF > >    pgbench (17.5 (Ubuntu 17.5-1.pgdg24.04+1)) >    progress: 1.0 s, 2606.8 tps, lat 33.588 ms stddev 13.316, 0 failed >    progress: 2.0 s, 3315.0 tps, lat 30.174 ms stddev 5.933, 0 failed >    progress: 3.0 s, 3357.0 tps, lat 29.699 ms stddev 5.541, 0 failed >    progress: 4.0 s, 3350.0 tps, lat 29.911 ms stddev 5.311, 0 failed >    progress: 5.0 s, 3206.0 tps, lat 30.999 ms stddev 6.343, 0 failed >    progress: 6.0 s, 3264.0 tps, lat 30.828 ms stddev 6.389, 0 failed >    progress: 7.0 s, 3224.0 tps, lat 31.099 ms stddev 6.197, 0 failed >    progress: 8.0 s, 3168.0 tps, lat 31.486 ms stddev 6.940, 0 failed >    progress: 9.0 s, 3118.0 tps, lat 32.004 ms stddev 6.546, 0 failed >    progress: 10.0 s, 3017.0 tps, lat 33.183 ms stddev 7.971, 0 failed >    progress: 11.0 s, 3157.0 tps, lat 31.697 ms stddev 6.624, 0 failed >    progress: 12.0 s, 3180.0 tps, lat 31.415 ms stddev 6.310, 0 failed >    progress: 13.0 s, 3150.9 tps, lat 31.591 ms stddev 6.280, 0 failed >    progress: 14.0 s, 3329.0 tps, lat 30.189 ms stddev 5.792, 0 failed >    progress: 15.0 s, 3233.6 tps, lat 30.852 ms stddev 5.723, 0 failed >    progress: 16.0 s, 3185.4 tps, lat 31.378 ms stddev 6.383, 0 failed >    progress: 17.0 s, 3035.0 tps, lat 32.920 ms stddev 7.390, 0 failed >    progress: 18.0 s, 3173.0 tps, lat 31.547 ms stddev 6.390, 0 failed >    progress: 19.0 s, 3077.0 tps, lat 32.427 ms stddev 6.634, 0 failed >    progress: 20.0 s, 3266.1 tps, lat 30.740 ms stddev 5.842, 0 failed >    progress: 21.0 s, 2990.9 tps, lat 33.353 ms stddev 7.019, 0 failed >    progress: 22.0 s, 3048.1 tps, lat 32.933 ms stddev 6.951, 0 failed >    progress: 23.0 s, 3148.0 tps, lat 31.769 ms stddev 6.077, 0 failed >    progress: 24.0 s, 1523.2 tps, lat 30.029 ms stddev 5.093, 0 failed >    progress: 25.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed >    progress: 26.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed >    progress: 27.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed >    progress: 28.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed >    progress: 29.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed >    progress: 30.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed >    progress: 31.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed >    progress: 32.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed >    progress: 33.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed >    progress: 34.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed >    progress: 35.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed > > After some time, all requests on the replica hang waiting for IPC/MultixactCreation. > > Output from `pg_stat_activity` > ------------------------------ > >            backend_type        | state  | wait_event_type |    wait_event     |                  query >    ----------------------------+--------+-----------------+-------------------+------------------------------------------ >     client backend             | active | IPC             | MultixactCreation | select sum(val) from tbl; >     client backend             | active | IPC             | MultixactCreation | select sum(val) from tbl; >     client backend             | active | IPC             | MultixactCreation | select sum(val) from tbl; >     client backend             | active | IPC             | MultixactCreation | select sum(val) from tbl; >    ... >     client backend             | active | IPC             | MultixactCreation | select sum(val) from tbl; >     client backend             | active | IPC             | MultixactCreation | select sum(val) from tbl; >     client backend             | active | IPC             | MultixactCreation | select sum(val) from tbl; >     startup                    |        | LWLock          | BufferContent     | >     checkpointer               |        | Activity        | CheckpointerMain  | >     background writer          |        | Activity        | BgwriterHibernate | >     walreceiver                |        | Activity        | WalReceiverMain   | > > > gdb session for `client backend` process > ---------------------------------------- > >    (gdb) bt >    #0  0x00007f0e9872a007 in epoll_wait (epfd=5, events=0x57c4747fc458, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 >    #1  0x000057c440685033 in WaitEventSetWaitBlock (nevents=, occurred_events=0x7ffdaedc8360, cur_timeout=-1, set=0x57c4747fc3f0) >        at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/ipc/latch.c:1577 >    #2  WaitEventSetWait (set=0x57c4747fc3f0, timeout=timeout@entry=-1, occurred_events=occurred_events@entry=0x7ffdaedc8360, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=134217765) >        at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/ipc/latch.c:1525 >    #3  0x000057c44068541c in WaitLatch (latch=, wakeEvents=, timeout=, wait_event_info=134217765) >        at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/ipc/latch.c:538 >    #4  0x000057c44068d8c0 in ConditionVariableTimedSleep (cv=0x7f0cefc50ab0, timeout=-1, wait_event_info=134217765) >        at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/lmgr/condition_variable.c:163 >    #5  0x000057c440365a0c in ConditionVariableSleep (wait_event_info=134217765, cv=) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/lmgr/condition_variable.c:98 >    #6  GetMultiXactIdMembers (multi=45559845, members=0x7ffdaedc84b0, from_pgupgrade=, isLockOnly=) >        at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/multixact.c:1483 >    #7  0x000057c4408adc6b in MultiXactIdGetUpdateXid.isra.0 (xmax=xmax@entry=45559845, t_infomask=) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:7478 >    #8  0x000057c44031ecfa in HeapTupleGetUpdateXid (tuple=) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:7519 >    #9  HeapTupleSatisfiesMVCC (htup=, buffer=404, snapshot=0x57c474892ff0) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam_visibility.c:1090 >    #10 HeapTupleSatisfiesVisibility (htup=, snapshot=0x57c474892ff0, buffer=404) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam_visibility.c:1772 >    #11 0x000057c44030c1cb in page_collect_tuples (check_serializable=, all_visible=, lines=, block=, buffer=, page=, >        snapshot=, scan=) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:480 >    #12 heap_prepare_pagescan (sscan=0x57c47495b970) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:579 >    #13 0x000057c44030cb59 in heapgettup_pagemode (scan=scan@entry=0x57c47495b970, dir=, nkeys=, key=) >        at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:999 >    #14 0x000057c44030d1bd in heap_getnextslot (sscan=0x57c47495b970, direction=, slot=0x57c47494b278) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:1319 >    #15 0x000057c4404f090a in table_scan_getnextslot (slot=0x57c47494b278, direction=ForwardScanDirection, sscan=) >        at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/include/access/tableam.h:1072 >    #16 SeqNext (node=0x57c47494b0e8) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/nodeSeqscan.c:80 >    #17 0x000057c4404d5cfc in ExecProcNode (node=0x57c47494b0e8) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/include/executor/executor.h:274 >    #18 fetch_input_tuple (aggstate=aggstate@entry=0x57c47494aaf0) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/nodeAgg.c:561 >    #19 0x000057c4404d848a in agg_retrieve_direct (aggstate=0x57c47494aaf0) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/nodeAgg.c:2459 >    #20 ExecAgg (pstate=0x57c47494aaf0) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/nodeAgg.c:2179 >    #21 0x000057c4404c2003 in ExecProcNode (node=0x57c47494aaf0) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/include/executor/executor.h:274 >    #22 ExecutePlan (dest=0x57c47483d548, direction=, numberTuples=0, sendTuples=true, operation=CMD_SELECT, queryDesc=0x57c474895010) >        at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/execMain.c:1649 >    #23 standard_ExecutorRun (queryDesc=0x57c474895010, direction=, count=0, execute_once=) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/execMain.c:361 >    ... > > gdb session for `startup` process > --------------------------------- > >    (gdb) bt >    #0  0x00007f0e98698ce3 in __futex_abstimed_wait_common64 (private=, cancel=true, abstime=0x0, op=265, expected=0, futex_word=0x7f0ceb34e6b8) at ./nptl/futex-internal.c:57 >    #1  __futex_abstimed_wait_common (cancel=true, private=, abstime=0x0, clockid=0, expected=0, futex_word=0x7f0ceb34e6b8) at ./nptl/futex-internal.c:87 >    #2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7f0ceb34e6b8, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=) >        at ./nptl/futex-internal.c:139 >    #3  0x00007f0e986a4f1f in do_futex_wait (sem=sem@entry=0x7f0ceb34e6b8, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:111 >    #4  0x00007f0e986a4fb8 in __new_sem_wait_slow64 (sem=sem@entry=0x7f0ceb34e6b8, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:183 >    #5  0x00007f0e986a503d in __new_sem_wait (sem=sem@entry=0x7f0ceb34e6b8) at ./nptl/sem_wait.c:42 >    #6  0x000057c440696166 in PGSemaphoreLock (sema=0x7f0ceb34e6b8) at port/pg_sema.c:327 >    #7  LWLockAcquire (lock=0x7f0cefc58064, mode=LW_EXCLUSIVE) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/lmgr/lwlock.c:1289 >    #8  0x000057c44038f96a in LockBuffer (mode=2, buffer=) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/buffer/bufmgr.c:5147 >    #9  XLogReadBufferForRedoExtended (record=, block_id=, mode=RBM_NORMAL, get_cleanup_lock=false, buf=0x7ffdaedc8b4c) >        at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/xlogutils.c:429 >    #10 0x000057c440319969 in XLogReadBufferForRedo (buf=0x7ffdaedc8b4c, block_id=0 '\000', record=0x57c4748994d8) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/xlogutils.c:317 >    #11 heap_xlog_lock_updated (record=0x57c4748994d8) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:10230 >    #12 heap2_redo (record=0x57c4748994d8) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:10362 >    #13 0x000057c44038e1d2 in ApplyWalRecord (replayTLI=, record=0x7f0e983908e0, xlogreader=) >        at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/include/access/xlog_internal.h:380 >    #14 PerformWalRecovery () at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/xlogrecovery.c:1822 >    #15 0x000057c44037bbf6 in StartupXLOG () at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/xlog.c:5821 >    #16 0x000057c4406155ed in StartupProcessMain (startup_data=, startup_data_len=) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/startup.c:258 >    #17 0x000057c44060b376 in postmaster_child_launch (child_type=B_STARTUP, startup_data=0x0, startup_data_len=0, client_sock=0x0) >        at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/launch_backend.c:277 >    #18 0x000057c440614509 in postmaster_child_launch (client_sock=0x0, startup_data_len=0, startup_data=0x0, child_type=B_STARTUP) >        at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:3934 >    #19 StartChildProcess (type=type@entry=B_STARTUP) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:3930 >    #20 0x000057c44061480d in PostmasterStateMachine () at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:3392 >    #21 0x000057c4408a3455 in process_pm_child_exit () at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:2683 >    #22 ServerLoop.isra.0 () at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:1667 >    #23 0x000057c440616965 in PostmasterMain (argc=, argv=) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:1374 >    #24 0x000057c4402bcd2d in main (argc=17, argv=0x57c4747fb140) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/main/main.c:199 > > Could you please help me to fix the problem of stuck 'client backend' processes? > > I kindly ask you for any ideas and recommendations! > > Best regards, > Dmitry A small addition to the description of the problem:     - the problem is not reproducible on the 16th version of PostgreSQL Best regards, Dmitry --------------gsYLBxwUSsf5wF79wKwR0WHs Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit
On 23.06.2025 16:33, Dmitry wrote:
Hi,

The problem is as follows.
A replication cluster includes a primary server and one hot-standby replica.
The workload on the primary server is represented by multiple requests generating multixact IDs, while the hot-standby replica performs reading requests.

After some time, all requests on the hot-standby are stuck and never get finished.

The `pg_stat_activity` view on the replica reports that processes are stuck waiting for IPC/MultixactCreation,
pg_cancel_backend and pg_terminate_backend cannot cancel the request, SIGQUIT is the only way to stop it.

We tried:
* changing the `autovacuum_multixact_freeze_max_age` parameters,
* increasing `multixact_member_buffers` and `multixact_offset_buffers`,
* disabling `hot_standby_feedback`,
* switching the replica to synchronous and asynchronous mode,
* and much more.
But nothing helped.

We ran the replica in recovery mode from WAL archive, i.e. as warm-standby, the result is the same.

We tried to build from the sources based on REL_17_5 branch with the default configure settings
    ./configure
    make
    make install
But got no luck.

Here is an example with a synthetic workload reproducing the problem.

Test system
===========

-   Architecture: x86_64
-   OS: Ubuntu 24.04.2 LTS (Noble Numbat)
-   Tested postgres version(s):
    -   latest 17 (17.5)
    -   latest 18 (18-beta1)

Steps to reproduce
==================

    postgres=# create table tbl (
        id int primary key,
        val int
    );
    postgres=# insert into tbl select i, 0 from generate_series(1,5) i;


The first and second scripts execute queries on the master server
-----------------------------------------------------------------

    pgbench --no-vacuum --report-per-command -M prepared -c 200 -j 200 -T 300 -P 1 --file=/dev/stdin <<'EOF'
    \set id random(1, 5)
    begin;
    select * from tbl where id = :id for key share;
    commit;
    EOF

    pgbench --no-vacuum --report-per-command -M prepared -c 100 -j 100 -T 300 -P 1 --file=/dev/stdin <<'EOF'
    \set id random(1, 5)
    begin;
    update tbl set val = val+1 where id = :id;
    \sleep 10 ms
    commit;
    EOF


The following script is executed on the replica
-----------------------------------------------

    pgbench --no-vacuum --report-per-command -M prepared -c 100 -j 100 -T 300 -P 1 --file=/dev/stdin <<'EOF'
    begin;
    select sum(val) from tbl;
    \sleep 10 ms
    select sum(val) from tbl;
    \sleep 10 ms
    commit;
    EOF

    pgbench (17.5 (Ubuntu 17.5-1.pgdg24.04+1))
    progress: 1.0 s, 2606.8 tps, lat 33.588 ms stddev 13.316, 0 failed
    progress: 2.0 s, 3315.0 tps, lat 30.174 ms stddev 5.933, 0 failed
    progress: 3.0 s, 3357.0 tps, lat 29.699 ms stddev 5.541, 0 failed
    progress: 4.0 s, 3350.0 tps, lat 29.911 ms stddev 5.311, 0 failed
    progress: 5.0 s, 3206.0 tps, lat 30.999 ms stddev 6.343, 0 failed
    progress: 6.0 s, 3264.0 tps, lat 30.828 ms stddev 6.389, 0 failed
    progress: 7.0 s, 3224.0 tps, lat 31.099 ms stddev 6.197, 0 failed
    progress: 8.0 s, 3168.0 tps, lat 31.486 ms stddev 6.940, 0 failed
    progress: 9.0 s, 3118.0 tps, lat 32.004 ms stddev 6.546, 0 failed
    progress: 10.0 s, 3017.0 tps, lat 33.183 ms stddev 7.971, 0 failed
    progress: 11.0 s, 3157.0 tps, lat 31.697 ms stddev 6.624, 0 failed
    progress: 12.0 s, 3180.0 tps, lat 31.415 ms stddev 6.310, 0 failed
    progress: 13.0 s, 3150.9 tps, lat 31.591 ms stddev 6.280, 0 failed
    progress: 14.0 s, 3329.0 tps, lat 30.189 ms stddev 5.792, 0 failed
    progress: 15.0 s, 3233.6 tps, lat 30.852 ms stddev 5.723, 0 failed
    progress: 16.0 s, 3185.4 tps, lat 31.378 ms stddev 6.383, 0 failed
    progress: 17.0 s, 3035.0 tps, lat 32.920 ms stddev 7.390, 0 failed
    progress: 18.0 s, 3173.0 tps, lat 31.547 ms stddev 6.390, 0 failed
    progress: 19.0 s, 3077.0 tps, lat 32.427 ms stddev 6.634, 0 failed
    progress: 20.0 s, 3266.1 tps, lat 30.740 ms stddev 5.842, 0 failed
    progress: 21.0 s, 2990.9 tps, lat 33.353 ms stddev 7.019, 0 failed
    progress: 22.0 s, 3048.1 tps, lat 32.933 ms stddev 6.951, 0 failed
    progress: 23.0 s, 3148.0 tps, lat 31.769 ms stddev 6.077, 0 failed
    progress: 24.0 s, 1523.2 tps, lat 30.029 ms stddev 5.093, 0 failed
    progress: 25.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 26.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 27.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 28.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 29.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 30.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 31.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 32.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 33.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 34.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 35.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed

After some time, all requests on the replica hang waiting for IPC/MultixactCreation.

Output from `pg_stat_activity`
------------------------------

            backend_type        | state  | wait_event_type |    wait_event     |                  query
    ----------------------------+--------+-----------------+-------------------+------------------------------------------
     client backend             | active | IPC             | MultixactCreation | select sum(val) from tbl;
     client backend             | active | IPC             | MultixactCreation | select sum(val) from tbl;
     client backend             | active | IPC             | MultixactCreation | select sum(val) from tbl;
     client backend             | active | IPC             | MultixactCreation | select sum(val) from tbl;
    ...
     client backend             | active | IPC             | MultixactCreation | select sum(val) from tbl;
     client backend             | active | IPC             | MultixactCreation | select sum(val) from tbl;
     client backend             | active | IPC             | MultixactCreation | select sum(val) from tbl;
     startup                    |        | LWLock          | BufferContent     |
     checkpointer               |        | Activity        | CheckpointerMain  |
     background writer          |        | Activity        | BgwriterHibernate |
     walreceiver                |        | Activity        | WalReceiverMain   |


gdb session for `client backend` process
----------------------------------------

    (gdb) bt
    #0  0x00007f0e9872a007 in epoll_wait (epfd=5, events=0x57c4747fc458, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
    #1  0x000057c440685033 in WaitEventSetWaitBlock (nevents=<optimized out>, occurred_events=0x7ffdaedc8360, cur_timeout=-1, set=0x57c4747fc3f0)
        at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/ipc/latch.c:1577
    #2  WaitEventSetWait (set=0x57c4747fc3f0, timeout=timeout@entry=-1, occurred_events=occurred_events@entry=0x7ffdaedc8360, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=134217765)
        at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/ipc/latch.c:1525
    #3  0x000057c44068541c in WaitLatch (latch=<optimized out>, wakeEvents=<optimized out>, timeout=<optimized out>, wait_event_info=134217765)
        at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/ipc/latch.c:538
    #4  0x000057c44068d8c0 in ConditionVariableTimedSleep (cv=0x7f0cefc50ab0, timeout=-1, wait_event_info=134217765)
        at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/lmgr/condition_variable.c:163
    #5  0x000057c440365a0c in ConditionVariableSleep (wait_event_info=134217765, cv=<optimized out>) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/lmgr/condition_variable.c:98
    #6  GetMultiXactIdMembers (multi=45559845, members=0x7ffdaedc84b0, from_pgupgrade=<optimized out>, isLockOnly=<optimized out>)
        at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/multixact.c:1483
    #7  0x000057c4408adc6b in MultiXactIdGetUpdateXid.isra.0 (xmax=xmax@entry=45559845, t_infomask=<optimized out>) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:7478
    #8  0x000057c44031ecfa in HeapTupleGetUpdateXid (tuple=<error reading variable: Cannot access memory at address 0x0>) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:7519
    #9  HeapTupleSatisfiesMVCC (htup=<optimized out>, buffer=404, snapshot=0x57c474892ff0) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam_visibility.c:1090
    #10 HeapTupleSatisfiesVisibility (htup=<optimized out>, snapshot=0x57c474892ff0, buffer=404) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam_visibility.c:1772
    #11 0x000057c44030c1cb in page_collect_tuples (check_serializable=<optimized out>, all_visible=<optimized out>, lines=<optimized out>, block=<optimized out>, buffer=<optimized out>, page=<optimized out>,
        snapshot=<optimized out>, scan=<optimized out>) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:480
    #12 heap_prepare_pagescan (sscan=0x57c47495b970) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:579
    #13 0x000057c44030cb59 in heapgettup_pagemode (scan=scan@entry=0x57c47495b970, dir=<optimized out>, nkeys=<optimized out>, key=<optimized out>)
        at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:999
    #14 0x000057c44030d1bd in heap_getnextslot (sscan=0x57c47495b970, direction=<optimized out>, slot=0x57c47494b278) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:1319
    #15 0x000057c4404f090a in table_scan_getnextslot (slot=0x57c47494b278, direction=ForwardScanDirection, sscan=<optimized out>)
        at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/include/access/tableam.h:1072
    #16 SeqNext (node=0x57c47494b0e8) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/nodeSeqscan.c:80
    #17 0x000057c4404d5cfc in ExecProcNode (node=0x57c47494b0e8) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/include/executor/executor.h:274
    #18 fetch_input_tuple (aggstate=aggstate@entry=0x57c47494aaf0) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/nodeAgg.c:561
    #19 0x000057c4404d848a in agg_retrieve_direct (aggstate=0x57c47494aaf0) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/nodeAgg.c:2459
    #20 ExecAgg (pstate=0x57c47494aaf0) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/nodeAgg.c:2179
    #21 0x000057c4404c2003 in ExecProcNode (node=0x57c47494aaf0) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/include/executor/executor.h:274
    #22 ExecutePlan (dest=0x57c47483d548, direction=<optimized out>, numberTuples=0, sendTuples=true, operation=CMD_SELECT, queryDesc=0x57c474895010)
        at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/execMain.c:1649
    #23 standard_ExecutorRun (queryDesc=0x57c474895010, direction=<optimized out>, count=0, execute_once=<optimized out>) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/execMain.c:361
    ...

gdb session for `startup` process
---------------------------------

    (gdb) bt
    #0  0x00007f0e98698ce3 in __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=265, expected=0, futex_word=0x7f0ceb34e6b8) at ./nptl/futex-internal.c:57
    #1  __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x0, clockid=0, expected=0, futex_word=0x7f0ceb34e6b8) at ./nptl/futex-internal.c:87
    #2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7f0ceb34e6b8, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>)
        at ./nptl/futex-internal.c:139
    #3  0x00007f0e986a4f1f in do_futex_wait (sem=sem@entry=0x7f0ceb34e6b8, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:111
    #4  0x00007f0e986a4fb8 in __new_sem_wait_slow64 (sem=sem@entry=0x7f0ceb34e6b8, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:183
    #5  0x00007f0e986a503d in __new_sem_wait (sem=sem@entry=0x7f0ceb34e6b8) at ./nptl/sem_wait.c:42
    #6  0x000057c440696166 in PGSemaphoreLock (sema=0x7f0ceb34e6b8) at port/pg_sema.c:327
    #7  LWLockAcquire (lock=0x7f0cefc58064, mode=LW_EXCLUSIVE) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/lmgr/lwlock.c:1289
    #8  0x000057c44038f96a in LockBuffer (mode=2, buffer=<optimized out>) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/buffer/bufmgr.c:5147
    #9  XLogReadBufferForRedoExtended (record=<optimized out>, block_id=<optimized out>, mode=RBM_NORMAL, get_cleanup_lock=false, buf=0x7ffdaedc8b4c)
        at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/xlogutils.c:429
    #10 0x000057c440319969 in XLogReadBufferForRedo (buf=0x7ffdaedc8b4c, block_id=0 '\000', record=0x57c4748994d8) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/xlogutils.c:317
    #11 heap_xlog_lock_updated (record=0x57c4748994d8) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:10230
    #12 heap2_redo (record=0x57c4748994d8) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:10362
    #13 0x000057c44038e1d2 in ApplyWalRecord (replayTLI=<synthetic pointer>, record=0x7f0e983908e0, xlogreader=<optimized out>)
        at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/include/access/xlog_internal.h:380
    #14 PerformWalRecovery () at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/xlogrecovery.c:1822
    #15 0x000057c44037bbf6 in StartupXLOG () at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/xlog.c:5821
    #16 0x000057c4406155ed in StartupProcessMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/startup.c:258
    #17 0x000057c44060b376 in postmaster_child_launch (child_type=B_STARTUP, startup_data=0x0, startup_data_len=0, client_sock=0x0)
        at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/launch_backend.c:277
    #18 0x000057c440614509 in postmaster_child_launch (client_sock=0x0, startup_data_len=0, startup_data=0x0, child_type=B_STARTUP)
        at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:3934
    #19 StartChildProcess (type=type@entry=B_STARTUP) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:3930
    #20 0x000057c44061480d in PostmasterStateMachine () at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:3392
    #21 0x000057c4408a3455 in process_pm_child_exit () at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:2683
    #22 ServerLoop.isra.0 () at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:1667
    #23 0x000057c440616965 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:1374
    #24 0x000057c4402bcd2d in main (argc=17, argv=0x57c4747fb140) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/main/main.c:199

Could you please help me to fix the problem of stuck 'client backend' processes?

I kindly ask you for any ideas and recommendations!

Best regards,
Dmitry
 

A small addition to the description of the problem:
    - the problem is not reproducible on the 16th version of PostgreSQL

Best regards,
Dmitry

--------------gsYLBxwUSsf5wF79wKwR0WHs--