Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vdssz-006AXA-2s for pgsql-hackers@arkaria.postgresql.org; Thu, 08 Jan 2026 16:29:23 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vdssx-0038Se-12 for pgsql-hackers@arkaria.postgresql.org; Thu, 08 Jan 2026 16:29:20 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vdssw-0038SW-2z for pgsql-hackers@lists.postgresql.org; Thu, 08 Jan 2026 16:29:19 +0000 Received: from mail-ed1-x536.google.com ([2a00:1450:4864:20::536]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vdssu-005JBE-2V for pgsql-hackers@lists.postgresql.org; Thu, 08 Jan 2026 16:29:19 +0000 Received: by mail-ed1-x536.google.com with SMTP id 4fb4d7f45d1cf-64b8b5410a1so4890844a12.2 for ; Thu, 08 Jan 2026 08:29:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1767889755; x=1768494555; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=UbDUA1b6K4mgJVEceO1Rk1qfb0am51/z9OZhRbZVydI=; b=bGr/p7AA4m6EVdc++4q8xrsYc+MZRV+B4zKk9jLhHCPwTYbDBsmds7Llv3/vlDFfIj ET5XIQPkM13urwqfyUvrf88gIgVaZx8tG3/aLmnfoLD7DgHIcLp5n5xvfbIWSEIojUDl FkIyKQYdYpg835ug851/6rCcT93alQn0y/F/CR4jVz4NcbOko4JLMiHmc/UDudCNEjMB 6DZEbcAxEwV7AM1DBy2TIifcrBK23ZA62H4WamAPMMmh8DuLeZ/liRaoyB0ze2wH1FTV R7s/uYTkV0RmZXKG+d3ZXIGKcMsmfwDsCw3sHvOhzlW1x96C2obx0ogWut119MNOnqJn LR1A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1767889755; x=1768494555; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=UbDUA1b6K4mgJVEceO1Rk1qfb0am51/z9OZhRbZVydI=; b=MnQl+VvQ59C0QkcF2EJVTUjzmdEvSdy/S0+JwTuCob5baHuoUedfIkbSQTrzb6WIUt eIQxQkMG2zYRN6f429mfohx6gD6SJpCmyUY2vMbI9z7dn0fotNKUEspuz0TmFX5dKsUs 5iNSyDIz6fvskKxSwO/0VpMo2sRNTfgJI+/48UUoHD+sB3KVxcc4hD6Svlnp89OS2tG6 cQMgSv3hlGkZ8wNsUYkc8B8hjA2Y89sMPzr33AiPAwKxQAXGplQKhGOWXMLFEjBuIy+W xeJFaV3eVZPvBked2jDuhFmu+PxpPSNAyv8sGNf9UUXEq1DUjfFm1Y//b2DIlm0LckOq Ch8A== X-Forwarded-Encrypted: i=1; AJvYcCUvLOAl7VVfK0AAZIF/0Fiab/qkI1rtF4jpZ+OfVsRwv7NmIalyq5lms2pAQURjg4nvCjCMQYEdGX6bUMPr@lists.postgresql.org X-Gm-Message-State: AOJu0Yyapim/WUVQj6K3AZPjdnCCdqrYJPVkKGdfHGUvaYYntq6JcS9g Z5Wve4sD/wYhW1UFQFSqny9iK6qpkXd2MvzmqwVxWINZZZ5dDerO1yhSNy7AnUDzzqjyG29t70V j52Ye51jPKPNH7bX4KQ1gGtKWma0L/5Y= X-Gm-Gg: AY/fxX5ZP0WCQt3sJWuFwx9QT54jaxnc3vVivF4B6tBrbQJt27La7h2egQM2qW/Vgbu BYaPfCkJx+tpAlDHLfxv6OviXsmiGV7QN3fPOEq/wY+0wOJHce8CoLhbZykxh9X8Duw3m88ZQ5M 2vyoCnazPK1luL+9jwUekjImASKtTiqQXwrtymYUOuDOejOoVFPCiuZ6scE2aCtPxw36JI8Jn/x RjNmsA/UeDCBraHH8rCYDYfHuRxTyJ9yuJ2cmkIYfLOpRutUQL+EXANbi4a7BAggtKLyXzAE1Ng l6RhaPkfp0PgjuOPbHyHAoNu22y1zeXq7LVBpwolQ+h5FDasVxuOhLVWawxo2B/Utl0YHG4= X-Google-Smtp-Source: AGHT+IHhrL1OUvIplFIAkf+tdovaBs2Qd6EyHQ4V0rihmNE/9lvxDjrxVs6q9aL0PYV7Ybe2jStr1UbPafRkgBEgjmM= X-Received: by 2002:a17:907:3e0f:b0:b73:42df:29a with SMTP id a640c23a62f3a-b84453b3f02mr644420966b.59.1767889755164; Thu, 08 Jan 2026 08:29:15 -0800 (PST) MIME-Version: 1.0 References: <202601011659.ikh4ku4p3ovb@alvherre.pgsql> In-Reply-To: From: Xuneng Zhou Date: Fri, 9 Jan 2026 00:29:01 +0800 X-Gm-Features: AQt7F2oDOgGac0GHAUMc6iveTgsI6fxhfjP5l071PonzEGvQqfGNSPjar3_wQs0 Message-ID: Subject: Re: Implement waiting for wal lsn replay: reloaded To: Alexander Korotkov Cc: Andres Freund , Thomas Munro , =?UTF-8?Q?=C3=81lvaro_Herrera?= , Chao Li , pgsql-hackers , Michael Paquier , jian he , Tomas Vondra , Yura Sokolov Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, On Thu, Jan 8, 2026 at 10:19=E2=80=AFPM Alexander Korotkov wrote: > > On Wed, Jan 7, 2026 at 6:08=E2=80=AFAM Xuneng Zhou = wrote: > > On Wed, Jan 7, 2026 at 8:32=E2=80=AFAM Andres Freund wrote: > > > On 2026-01-06 18:42:59 +1300, Thomas Munro wrote: > > > > Could this be causing the recent flapping failures on CI/macOS in > > > > recovery/031_recovery_conflict? I didn't have time to dig personal= ly > > > > but f30848cb looks relevant: > > > > > > > > Waiting for replication conn standby's replay_lsn to pass 0/03467F5= 8 on primary > > > > error running SQL: 'psql::1: ERROR: canceling statement due= to > > > > conflict with recovery > > > > DETAIL: User was or might have been using tablespace that must be = dropped.' > > > > while running 'psql --no-psqlrc --no-align --tuples-only --quiet > > > > --dbname port=3D25195 > > > > host=3D/var/folders/g9/7rkt8rt1241bwwhd3_s8ndp40000gn/T/LqcCJnsueI > > > > dbname=3D'postgres' --file - --variable ON_ERROR_STOP=3D1' with sql= 'WAIT > > > > FOR LSN '0/03467F58' WITH (MODE 'standby_replay', timeout '180s', > > > > no_throw);' at /Users/admin/pgsql/src/test/perl/PostgreSQL/Test/Clu= ster.pm > > > > line 2300. > > > > > > > > https://cirrus-ci.com/task/5771274900733952 > > > > > > > > The master branch in time-descending order, macOS tasks only: > > > > > > > > task_id | substring | status > > > > ------------------+-----------+----------- > > > > 6460882231754752 | c970bdc0 | FAILED > > > > 5771274900733952 | 6ca8506e | FAILED > > > > 6217757068361728 | 63ed3bc7 | FAILED > > > > 5980650261446656 | ae283736 | FAILED > > > > 6585898394976256 | 5f13999a | COMPLETED > > > > 4527474786172928 | 7f9acc9b | COMPLETED > > > > 4826100842364928 | e8d4e94a | COMPLETED > > > > 4540563027918848 | b9ee5f2d | FAILED > > > > 6358528648019968 | c5af141c | FAILED > > > > 5998005284765696 | e212a0f8 | COMPLETED > > > > 6488580526178304 | b85d5dc0 | FAILED > > > > 5034091344560128 | 7dc95cc3 | ABORTED > > > > 5688692477526016 | bb048e31 | COMPLETED > > > > 5481187977723904 | d351063e | COMPLETED > > > > 5101831568752640 | f30848cb | COMPLETED <-- the change > > > > 6395317408497664 | 3f33b63d | COMPLETED > > > > 6741325208354816 | 877ae5db | COMPLETED > > > > 4594007789010944 | de746e0d | COMPLETED > > > > 6497208998035456 | 461b8cc9 | COMPLETED > > > > > > The failure rates of this are very high - the majority of the CI runs= on the > > > postgres/postgres repos failed since the change went in. Which then a= lso means > > > cfbot has a very high spurious failure rate. I think we need to rever= t this > > > change until the problem has been verified as fixed. > > > > This specific failure can be reproduced with this patch v1. > > > > I guess the potential race condition is: when > > wait_for_replay_catchup() runs WAIT FOR LSN on the standby, if a > > tablespace conflict fires during that wait, the WAIT FOR LSN session > > is killed even though it doesn't use the tablespace. > > > > In my test, the failure won't occur after applying the v2 patch. > > I see, you were right. This is not related to the MyProc->xmin. > ResolveRecoveryConflictWithTablespace() calls > GetConflictingVirtualXIDs(InvalidTransactionId, InvalidOid). That > would kill WAIT FOR LSN query independently on its xmin. I think the concern is valid --- conflicts like PROCSIG_RECOVERY_CONFLICT_SNAPSHOT could occur and terminate the backend if the timing is unlucky. It's more difficult to reproduce though. A check for the log containing "conflict with recovery" would likely catch these conflicts as well. > I guess your > patch is the only way to go. It's clumsy to wrap WAIT FOR LSN call > with retry loop, but it would still consume less resources than > polling. > Assuming recovery conflicts are relatively rare in tap tests, except for the explicitly designed tests like 031_recovery_conflict and the narrow timing window that the standby has not caught up while the wait for gets invoked, a simple fallback seems appropriate to me. --=20 Best, Xuneng