Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wPXdd-000oYE-1E for pgsql-hackers@arkaria.postgresql.org; Wed, 20 May 2026 03:30:29 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wPXda-0063LF-0L for pgsql-hackers@arkaria.postgresql.org; Wed, 20 May 2026 03:30:27 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wPXdZ-0063Kk-1y for pgsql-hackers@lists.postgresql.org; Wed, 20 May 2026 03:30:26 +0000 Received: from mail-ed1-x536.google.com ([2a00:1450:4864:20::536]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wPXdX-00000000QT6-23cj for pgsql-hackers@lists.postgresql.org; Wed, 20 May 2026 03:30:25 +0000 Received: by mail-ed1-x536.google.com with SMTP id 4fb4d7f45d1cf-68707d88626so1435852a12.0 for ; Tue, 19 May 2026 20:30:24 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1779247822; cv=none; d=google.com; s=arc-20240605; b=UTLca4YfSsBady9ivgRJKd4YNj8XC4dFlG86tCQGVXtlRLWQuHkXCwjeo6kH+biwJg PCWPk9JuYLscPB8PuTZmojNufxoi7OK3cnR0RphZ0MA3xLZaaUzYLxY60vazCaE6Gwvm jzt9t7XZMkiyupFfXJLYuIUCQutw7Cqyiz9fmzK647goVd/rS076TgcxkdQVkN60F5mc Tty3xVt8d2as9EzHQxHlK0JarTWC3uoZe88RMUw6+P7IDQWoOacyR3y8ISV0hCGzKwtD 1PPT/iBrykyuETagUmWLPtTiDh0vZykBythnyEXeDKU6YkL8N8B++UqBk3BMrLm4udtY Xirg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=ceqt1cE8N3zg3s7sHC7cplp7lHcEIvR1N4ZWBObEknY=; fh=yCOn4e6F0gnrpR3k8KEJNwLUQsL8UKGIgrKoAihwehA=; b=MUzcEUG6ITRRuas8H9T924Ro8uy9p1JPwbyYMMceboN7LoaDS/cHSyRS++6H2tihZX 2H1s511vGeAOEEOYA+aAiItTerOktDE0BZkG+vYP0edugL98OGfGNIJWst37x4ah5n95 rVTBJGZesmJ4kpNxWVlxvmwEHgNqXYqWoOFMYvwB35wZTv4FVWfzOYZwoefR/yObqL88 L49/9u6XXQI1Pp5jnJrP/El54hzdftJHDtC5+w3HSNzFG5crwvmsrLdVdy5DKXcIHGCM K01QvCo0+j/k3Qt5zKDcR83VX/TzjPeIUaRdJg7oR5wY8f0DWq1Nrs5evgvk3VvzKHEi xuHA==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1779247822; x=1779852622; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ceqt1cE8N3zg3s7sHC7cplp7lHcEIvR1N4ZWBObEknY=; b=orCxuzWUUAmZgGZ19+3CkproiOmqOcHaPXn9YkZNJBJMYRh8iMM1WMWjY8/XBL9/2t wvxar4I62ouMqpuOgLUsEk3CGBOrcW8+qqGA+rAKfMUs3j8v2nKVQ4L/t+jKrnRAQxBT 3tRP2LHTIrbIpyUXeMdAsqkdu5K90QZVdt/jYNe3DW+Qy6UJ9uXCd9YRl1SRTQsO1RAU ST9cpItZNHY9vYrMrhusNU/Ioa/VfMUsG9b3eqFzVeOYuEvrR+9K8W5g1gDRyKl30UKs 7FFN69VCn9jIMq8hMTYDotu4Zs7IPWfmZC6UjS5+1aGkmCQpzsP2+WuNNfeLr0peSLjd ixqg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779247822; x=1779852622; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=ceqt1cE8N3zg3s7sHC7cplp7lHcEIvR1N4ZWBObEknY=; b=r/ieMbcWPqegfjXiRGRWSceMQn0mImEquCJ62AQhBgziMA/FlNL5CvI5aQRC0kJGng FG55neMwk1Q6wXTh399gjgpqlidEqzhJSCuSlIyFqI4zh8fmRT70orK+aE6yQeYvsP4m 9DZPtHD8OGPjT2WsTcRGchN2x4ZvCMFWd+hOYam9mLmcru24o3Grex3lgsNKsjZxaCji UuIaH5PeAA4aMO+xY5iV/dOKfiPxfnzenklE/FMWwoMXx4QMk4Vlj25Vk5Px65LAFu+Y NpF9IWI/WEXR14jV5MOicnqUtEgebbtijJPbluxt23SsIlqETEvBfrZnm1udZ8sD2Aol ZgtQ== X-Forwarded-Encrypted: i=1; AFNElJ9xCP9arbzTB7Si3XBwRdA/aPn2NzIRwVY1pcraZb7n+TKCqvxQMXTc9BSqpq4ldpoNOac9Q4A4mT4a2HyX@lists.postgresql.org X-Gm-Message-State: AOJu0YznvqcvTjT3Bdmyd3Nqv5r13Jm+uUAEysDsi0o0lhe01IYinR/l RG5IhKpKI3hZX+bQT0af57ZURSUqbulEFHOEb/EC92aaLhsn12p1lS+1Mkgl7CHV7pkLtBIYWPp BKTupKxm1q9DqGEB0dEo5/XFgOP0KtZs= X-Gm-Gg: Acq92OEob39cs1hZOKCZnhb+q/eygYPgXXxehhrzuCRxF3pkRrK6yXonmh+jdv9Inkm f4J8JYObkd2O1azT0sq3Srm9tGvd+4dVhjkB9WnPVD3wmfJ8FddGIJI5Ts+NfofVK3ZCqQDwbyl qZvR59hFz5Y7BozMiBV3du3YVtVplD9rBogD/nBIxp9cXHXdcPSV8YCo1JGsjl0Uk52IGFkSijV cVHFlONBiKVA1q8mQIPH92strWSBhs1bFgUnmb0W08Re139rfEOHUhmVYkzAIldFCyYU3dhCeI0 Y/OxzaWYMkw9/DM0mmEFXzCqVwaSOo4gGDEDh0tznys5nFk3KpwupGARQMW2IsvW5yRVGIZSexm FfFbsm71N X-Received: by 2002:a17:907:9348:b0:bd8:bc8e:a321 with SMTP id a640c23a62f3a-bd8bc9de69cmr592706166b.8.1779247822335; Tue, 19 May 2026 20:30:22 -0700 (PDT) MIME-Version: 1.0 References: <63f6abc9-c0ae-465d-a4e6-667eca6ea008@gmail.com> In-Reply-To: <63f6abc9-c0ae-465d-a4e6-667eca6ea008@gmail.com> From: Xuneng Zhou Date: Tue, 19 May 2026 20:30:10 -0700 X-Gm-Features: AVHnY4KKKbJW0YWcOfM9ihg-3MMjFV8EYQMNcVvQMkhL0OaacWpPwyzzcKpcR8A Message-ID: Subject: Re: Implement waiting for wal lsn replay: reloaded To: Alexander Lakhin Cc: Alexander Korotkov , Heikki Linnakangas , Peter Eisentraut , Andres Freund , Thomas Munro , =?UTF-8?Q?=C3=81lvaro_Herrera?= , Chao Li , pgsql-hackers , Michael Paquier , jian he , Tomas Vondra , Yura Sokolov Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, On Tue, May 19, 2026 at 1:00=E2=80=AFPM Alexander Lakhin wrote: > > Hello Alexander and Xuneng, > > 06.04.2026 22:49, Alexander Korotkov wrote: > > Thank you, I've pushed your version of patchset. I made two minor > corrections for patch #2: mention default mode value in the header > comment, and fallback to polling on has_wal_read_bug sparc64+ext4 bug. > > > I discovered a new test failure, that is apparently caused by new > wait_for_catchup() implementation [1]: > [06:20:23.110](1.069s) not ok 8 - check that the slot state changes to "e= xtended" > [06:20:23.110](0.001s) # Failed test 'check that the slot state changes= to "extended"' > # at /Users/ec2-user/bf/goldfish/HEAD/pgsql/src/test/recovery/t/019_rep= lslot_limit.pl line 140. > [06:20:23.111](0.000s) # got: 'unreserved' > # expected: 'extended' > [06:20:23.231](0.120s) not ok 9 - check that the slot state changes to "u= nreserved" > [06:20:23.231](0.000s) # Failed test 'check that the slot state changes= to "unreserved"' > # at /Users/ec2-user/bf/goldfish/HEAD/pgsql/src/test/recovery/t/019_rep= lslot_limit.pl line 152. > [06:20:23.231](0.000s) # got: 'lost|' > # expected: 'unreserved|t' > > I've managed to reproduce such failures with: > diff --git a/src/backend/replication/walreceiver.c b/src/backend/replicat= ion/walreceiver.c > index 07eac07b9ce..493ce92674e 100644 > --- a/src/backend/replication/walreceiver.c > +++ b/src/backend/replication/walreceiver.c > @@ -1143,2 +1143,3 @@ XLogWalRcvSendReply(bool force, bool requestReply, = bool checkApply) > > +pg_usleep(10000); > /* Get current timestamp. */ > diff --git a/src/backend/replication/walsender.c b/src/backend/replicatio= n/walsender.c > index 04aa770d981..19cda3a6b51 100644 > --- a/src/backend/replication/walsender.c > +++ b/src/backend/replication/walsender.c > @@ -2521,2 +2521,3 @@ ProcessStandbyReplyMessage(void) > > +pg_usleep(100000); > /* the caller already consumed the msgtype byte */ > > Concretely, a loop: > for i in {1..100}; do echo "ITERATION $i"; PROVE_TESTS=3D"t/019*" make -s= check -C src/test/recovery/ || break; done > failed for me on iterations 2, 1, 7: > ITERATION 7 > # +++ tap check in src/test/recovery +++ > t/019_replslot_limit.pl .. 8/? > # Failed test 'check that the slot state changes to "extended"' > # at t/019_replslot_limit.pl line 140. > # got: 'unreserved' > # expected: 'extended' > t/019_replslot_limit.pl .. 26/? # Looks like you failed 1 test of 26. > t/019_replslot_limit.pl .. Dubious, test returned 1 (wstat 256, 0x100) > Failed 1/26 subtests > > With "WAIT FOR LSN" in wait_for_catchup() disabled, 100 iterations > passed. > > Having extra logging added, I could see the key difference. > Failed run: > 2026-05-19 22:01:37.968 EEST client backend[3632148] 019_replslot_limit.p= l LOG: !!!GetWALAvailability| targetLSN: 0/016C0000, targetSeg: 22, oldest= SlotSeg: 23, oldestSegMaxWalSize: 24, oldestSeg: 22 > 2026-05-19 22:01:37.968 EEST client backend[3632148] 019_replslot_limit.p= l STATEMENT: SELECT wal_status FROM pg_replication_slots WHERE slot_name = =3D 'rep1' > vs > Successful run: > 2026-05-19 22:04:18.102 EEST client backend[3633761] 019_replslot_limit.p= l LOG: !!!GetWALAvailability| targetLSN: 0/01700000, targetSeg: 23, oldest= SlotSeg: 23, oldestSegMaxWalSize: 24, oldestSeg: 23 > 2026-05-19 22:04:18.102 EEST client backend[3633761] 019_replslot_limit.p= l STATEMENT: SELECT wal_status FROM pg_replication_slots WHERE slot_name = =3D 'rep1' > > That is, with WAIT FOR LSN, primary in this test may advance > slot->data.restart_lsn to the expected position after wait_for_catchup() > returns. > > [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=3Dgoldfish&dt= =3D2026-05-13%2006%3A15%3A03 Thanks for reporting this issue. I think this is related to the semantic change made earlier: wait_for_catchup() now returns once the standby itself has reached the target LSN, rather than waiting until the primary observes that position via pg_stat_replication. As a result, the primary may not yet have processed the standby feedback needed to advance the slot's restart_lsn when wait_for_catchup() returns. Actually, I was aware of this semantic change and previously thought it might be harmless. But this failure appears to disprove that. I'll prepare a patch to fix this shortly. --=20 Regards, Xuneng Zhou HighGo Software Co., Ltd.