Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wPQbs-000j74-0U for pgsql-hackers@arkaria.postgresql.org; Tue, 19 May 2026 20:00:12 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wPQbn-004zL4-2U for pgsql-hackers@arkaria.postgresql.org; Tue, 19 May 2026 20:00:08 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wPQbn-004zKv-0w for pgsql-hackers@lists.postgresql.org; Tue, 19 May 2026 20:00:08 +0000 Received: from mail-wm1-x332.google.com ([2a00:1450:4864:20::332]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wPQbl-00000000NUu-128G for pgsql-hackers@lists.postgresql.org; Tue, 19 May 2026 20:00:07 +0000 Received: by mail-wm1-x332.google.com with SMTP id 5b1f17b1804b1-490229aa522so7919665e9.3 for ; Tue, 19 May 2026 13:00:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1779220803; x=1779825603; darn=lists.postgresql.org; h=in-reply-to:from:content-language:references:cc:to:subject :user-agent:mime-version:date:message-id:from:to:cc:subject:date :message-id:reply-to; bh=eNyWwHbSTyhXOUSChMnKEiHY9IXWscyUwzWvF8dbMSA=; b=hDzywQkSDwD61uFLN6PjqXezlYK95nuqywQt0InEyOVGq+oypQcsATUqaIIhkxQckv FI6Fsxpx+j1TFXW360DQky34rMzJqE6eC4BoPbmlY8TVYzwtEa1TpfN6kri/x/rssSGz 6GvWzIcRCmB7xEW4i3xqdNWiEvDgjhZtv6nQ1FNXO+N9D5G1wAb3YLKmZ62T76nHLBHa qKjU1GJXjODqDq//2PyGODbrCqG1+cEaJnffHrudN+ofaivjyQHrK8pKFWmvofgS27Sg VjotbJ99p5hZdyiML8c8MGTSlufeVD8WMpX1f21wUdzOsKcevBwxAlykKv3+H7qhZRF9 G6IQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779220803; x=1779825603; h=in-reply-to:from:content-language:references:cc:to:subject :user-agent:mime-version:date:message-id:x-gm-gg:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=eNyWwHbSTyhXOUSChMnKEiHY9IXWscyUwzWvF8dbMSA=; b=hAUp3rLOpz7AYmX8pD1LIPwwlTUMdXP2i9F1+HlvP896O00GKmsoaDIQUczw3nAzF6 /nal6GskJP0AW1XkEKgb/oVwKS6qr+g/CInk9KaS40jNMVjikOOiTRoWfAB64k8hOWqp VslvDfOLmHfPFWQFZV5dRPahxUPkVYOO/G9VxiLBzVo3ITBLS1OZp6xcuEHVJXIeWdjk 4wKvOiTEwy7tXawm+tplAP6HyvCTXq0c4mYjs0yUWEsRE1RdHM2RI5x+6c1HG3sTyIXw YvWkitYYLwY8bYCkXyZvwW5TMQf1e8x59TqxX1bZDMhzxMrLxWa+x6dVDL5G1+R6mhB/ gYZg== X-Forwarded-Encrypted: i=1; AFNElJ+DhsgAR3bdZm9ud22xUFBDFgt4R1guPdgzQNEMVeCstYONTH4h/vWutJVOBfaWl2zalTqWXx4mz7rFn0QM@lists.postgresql.org X-Gm-Message-State: AOJu0YxnkAEQmMhxbtlLNmZYPbVPKtutPuHFLEbsAK9SreFxs9jUjJAO N4w1O4KO9/H8DQ3cUC/ynL/kknfcXk/Ht1MifTpKP5liXA17IbPS5jNY X-Gm-Gg: Acq92OGmTeh52+eq49P8tKf/QY6iXQSXM3tGLZWW4BGCeLS0dmF0sThpvpcB0VKbzcC lS+KNcUQ/k0ztFyda0FsKTgxTz1IpyrQ4Lgs+GljY89ofYTQdvVYXa0DH/d6Z58KQ/DWnCZwPIO xcWI70NLv21yyBWL+SnlD7DeJMd9w55WZgwAH6jhKq1ugCPiUMLSD/ajH102b4zrfUvwk/F2i/z nOlX5wdZ/d0HJCkU3Juy40wXg+cF1eAH1Q8k0NSQg6sSgkTERSwu2/Ieur3YaWT5Cbko4KmV6hB nGQaHif2JbJjeEEN34eyNJh8ADY3Db5pTyoqMrWmiiJh5GQHgVLWMi2l+VvPU1/pkQQMVSzzYPx t/UDCziqo/B9I03kHTW3TJvPWFksAOyX0UczmuW2hc9qtMkj2h0J5PSd7mtXJwsU0gXPGC/nlNd t28l/DFi4+34tBeybGrZoulo8N X-Received: by 2002:a05:600c:a30a:b0:48a:568f:ae8a with SMTP id 5b1f17b1804b1-48fe5fda35fmr251077815e9.8.1779220802872; Tue, 19 May 2026 13:00:02 -0700 (PDT) Received: from [192.168.0.50] ([89.149.68.143]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-48febe6b60csm168161645e9.6.2026.05.19.13.00.01 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 19 May 2026 13:00:02 -0700 (PDT) Content-Type: multipart/alternative; boundary="------------miSBLYsP7S8e0CU5MRgGS7IT" Message-ID: <63f6abc9-c0ae-465d-a4e6-667eca6ea008@gmail.com> Date: Tue, 19 May 2026 23:00:00 +0300 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Implement waiting for wal lsn replay: reloaded To: Alexander Korotkov , Xuneng Zhou Cc: Heikki Linnakangas , Peter Eisentraut , Andres Freund , Thomas Munro , =?UTF-8?Q?=C3=81lvaro_Herrera?= , Chao Li , pgsql-hackers , Michael Paquier , jian he , Tomas Vondra , Yura Sokolov References: Content-Language: en-US From: Alexander Lakhin In-Reply-To: List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk This is a multi-part message in MIME format. --------------miSBLYsP7S8e0CU5MRgGS7IT Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Hello Alexander and Xuneng, 06.04.2026 22:49, Alexander Korotkov wrote: > Thank you, I've pushed your version of patchset. I made two minor > corrections for patch #2: mention default mode value in the header > comment, and fallback to polling on has_wal_read_bug sparc64+ext4 bug. I discovered a new test failure, that is apparently caused by new wait_for_catchup() implementation [1]: [06:20:23.110](1.069s) not ok 8 - check that the slot state changes to "extended" [06:20:23.110](0.001s) #   Failed test 'check that the slot state changes to "extended"' #   at /Users/ec2-user/bf/goldfish/HEAD/pgsql/src/test/recovery/t/019_replslot_limit.pl line 140. [06:20:23.111](0.000s) #          got: 'unreserved' #     expected: 'extended' [06:20:23.231](0.120s) not ok 9 - check that the slot state changes to "unreserved" [06:20:23.231](0.000s) #   Failed test 'check that the slot state changes to "unreserved"' #   at /Users/ec2-user/bf/goldfish/HEAD/pgsql/src/test/recovery/t/019_replslot_limit.pl line 152. [06:20:23.231](0.000s) #          got: 'lost|' #     expected: 'unreserved|t' I've managed to reproduce such failures with: diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index 07eac07b9ce..493ce92674e 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -1143,2 +1143,3 @@ XLogWalRcvSendReply(bool force, bool requestReply, bool checkApply) +pg_usleep(10000);      /* Get current timestamp. */ diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 04aa770d981..19cda3a6b51 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -2521,2 +2521,3 @@ ProcessStandbyReplyMessage(void) +pg_usleep(100000);      /* the caller already consumed the msgtype byte */ Concretely, a loop: for i in {1..100}; do echo "ITERATION $i"; PROVE_TESTS="t/019*" make -s check -C src/test/recovery/ || break; done failed for me on iterations 2, 1, 7: ITERATION 7 # +++ tap check in src/test/recovery +++ t/019_replslot_limit.pl .. 8/? #   Failed test 'check that the slot state changes to "extended"' #   at t/019_replslot_limit.pl line 140. #          got: 'unreserved' #     expected: 'extended' t/019_replslot_limit.pl .. 26/? # Looks like you failed 1 test of 26. t/019_replslot_limit.pl .. Dubious, test returned 1 (wstat 256, 0x100) Failed 1/26 subtests With "WAIT FOR LSN" in wait_for_catchup() disabled, 100 iterations passed. Having extra logging added, I could see the key difference. Failed run: 2026-05-19 22:01:37.968 EEST client backend[3632148] 019_replslot_limit.pl LOG:  !!!GetWALAvailability| targetLSN: 0/016C0000, targetSeg: 22, oldestSlotSeg: 23, oldestSegMaxWalSize: 24, oldestSeg: 22 2026-05-19 22:01:37.968 EEST client backend[3632148] 019_replslot_limit.pl STATEMENT:  SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep1' vs Successful run: 2026-05-19 22:04:18.102 EEST client backend[3633761] 019_replslot_limit.pl LOG:  !!!GetWALAvailability| targetLSN: 0/01700000, targetSeg: 23, oldestSlotSeg: 23, oldestSegMaxWalSize: 24, oldestSeg: 23 2026-05-19 22:04:18.102 EEST client backend[3633761] 019_replslot_limit.pl STATEMENT:  SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep1' That is, with WAIT FOR LSN, primary in this test may advance slot->data.restart_lsn to the expected position after wait_for_catchup() returns. [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=goldfish&dt=2026-05-13%2006%3A15%3A03 Best regards, Alexander --------------miSBLYsP7S8e0CU5MRgGS7IT Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit
Hello Alexander and Xuneng,

06.04.2026 22:49, Alexander Korotkov wrote:
Thank you, I've pushed your version of patchset.  I made two minor
corrections for patch #2: mention default mode value in the header
comment, and fallback to polling on has_wal_read_bug sparc64+ext4 bug.

I discovered a new test failure, that is apparently caused by new
wait_for_catchup() implementation [1]:
[06:20:23.110](1.069s) not ok 8 - check that the slot state changes to "extended"
[06:20:23.110](0.001s) #   Failed test 'check that the slot state changes to "extended"'
#   at /Users/ec2-user/bf/goldfish/HEAD/pgsql/src/test/recovery/t/019_replslot_limit.pl line 140.
[06:20:23.111](0.000s) #          got: 'unreserved'
#     expected: 'extended'
[06:20:23.231](0.120s) not ok 9 - check that the slot state changes to "unreserved"
[06:20:23.231](0.000s) #   Failed test 'check that the slot state changes to "unreserved"'
#   at /Users/ec2-user/bf/goldfish/HEAD/pgsql/src/test/recovery/t/019_replslot_limit.pl line 152.
[06:20:23.231](0.000s) #          got: 'lost|'
#     expected: 'unreserved|t'

I've managed to reproduce such failures with:
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 07eac07b9ce..493ce92674e 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1143,2 +1143,3 @@ XLogWalRcvSendReply(bool force, bool requestReply, bool checkApply)
 
+pg_usleep(10000);
     /* Get current timestamp. */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 04aa770d981..19cda3a6b51 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2521,2 +2521,3 @@ ProcessStandbyReplyMessage(void)
 
+pg_usleep(100000);
     /* the caller already consumed the msgtype byte */

Concretely, a loop:
for i in {1..100}; do echo "ITERATION $i"; PROVE_TESTS="t/019*" make -s check -C src/test/recovery/ || break; done
failed for me on iterations 2, 1, 7:
ITERATION 7
# +++ tap check in src/test/recovery +++
t/019_replslot_limit.pl .. 8/?
#   Failed test 'check that the slot state changes to "extended"'
#   at t/019_replslot_limit.pl line 140.
#          got: 'unreserved'
#     expected: 'extended'
t/019_replslot_limit.pl .. 26/? # Looks like you failed 1 test of 26.
t/019_replslot_limit.pl .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/26 subtests

With "WAIT FOR LSN" in wait_for_catchup() disabled, 100 iterations
passed.

Having extra logging added, I could see the key difference.
Failed run:
2026-05-19 22:01:37.968 EEST client backend[3632148] 019_replslot_limit.pl LOG:  !!!GetWALAvailability| targetLSN: 0/016C0000, targetSeg: 22, oldestSlotSeg: 23, oldestSegMaxWalSize: 24, oldestSeg: 22
2026-05-19 22:01:37.968 EEST client backend[3632148] 019_replslot_limit.pl STATEMENT:  SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep1'
vs
Successful run:
2026-05-19 22:04:18.102 EEST client backend[3633761] 019_replslot_limit.pl LOG:  !!!GetWALAvailability| targetLSN: 0/01700000, targetSeg: 23, oldestSlotSeg: 23, oldestSegMaxWalSize: 24, oldestSeg: 23
2026-05-19 22:04:18.102 EEST client backend[3633761] 019_replslot_limit.pl STATEMENT:  SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep1'

That is, with WAIT FOR LSN, primary in this test may advance
slot->data.restart_lsn to the expected position after wait_for_catchup()
returns.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=goldfish&dt=2026-05-13%2006%3A15%3A03

Best regards,
Alexander
--------------miSBLYsP7S8e0CU5MRgGS7IT--