Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wB4vr-000iMV-1P for pgsql-hackers@arkaria.postgresql.org; Fri, 10 Apr 2026 06:01:31 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wB4vp-009VUC-1f for pgsql-hackers@arkaria.postgresql.org; Fri, 10 Apr 2026 06:01:30 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wB4vp-009VTy-0K for pgsql-hackers@lists.postgresql.org; Fri, 10 Apr 2026 06:01:30 +0000 Received: from mail-pf1-x42f.google.com ([2607:f8b0:4864:20::42f]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wB4vn-00000000HYM-3Rlx for pgsql-hackers@lists.postgresql.org; Fri, 10 Apr 2026 06:01:29 +0000 Received: by mail-pf1-x42f.google.com with SMTP id d2e1a72fcca58-824c9da9928so1531233b3a.3 for ; Thu, 09 Apr 2026 23:01:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1775800886; cv=none; d=google.com; s=arc-20240605; b=ksO46vhZUzFBigajv2sPAGCwnW2TbZrdcvYKD01HWvsUzCY7mtlmNjphbiYmkxePRV vZK5n8KjexEbqxD2xjqoUuZ1aPwuhVEuFAlno/lMe1LPfvH6aaTa5Bpbjtv8scAD6d/v QpKJalS4GWz+KY+QZyIOLw05WMzfnPuPEuIvirnd07Xx389BKtj+rmoVjg02aGXCMQHi unImXkq1O/k4tFf5QZqGs2Ucyi1kqYXlrUWOh+xFKI++ZoafV2GLOrvIq7bWgLEWpI3t d1d4uNL8MAe8IEGGwPtystMxsLHH1w6Dk3Hb6YctXGh3bKGcEh2KCt6X/8ImMLh8CGuk 347A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=WanuYdhvwQCFpE70Fc2DNW+G9CYcdeHRuqsNFzJ/EU0=; fh=HkGD1jcdGy9807N8sFBBVCc7hiEmhhIYpWB+NVu1Esg=; b=ctc4vV3DJLA4yZoPBzaiMeuBPz6dTcFkdTOjenVAz3Zz5u6yJAh93u1JlgfKk5UEDU k+M911X8zdQK/yUQ716aXbSsYoR3Aj0zxHLV4I5ACS/UcBrlwAkfGWyNMbTpz4h2vXTB v5Pzf5V8qWg6ViSc/Y3rp90vk6VhKpn+ULgBbw/UY3Ri9A5v9/vPi1rpt3X9j+zCp1+S q5Jf66+Nil68jQFUfCuZel47N8gjmVFyJYou3TmBNqqM0V66kMHPpZo5ijdogrI4STMX 2cOtSYUDog34gPm1oOdh+MF/gzFFy9UX0dGNUB7+Yojik4pG5YkY8j5kkwrOjs2NVQEu Tfcg==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1775800886; x=1776405686; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=WanuYdhvwQCFpE70Fc2DNW+G9CYcdeHRuqsNFzJ/EU0=; b=riZfDh5EE3rzfZrdfe8oTXovbnilntD2nrn5Lrj5vpoK6sCeFhDTGnGoNY02FKdSTH U0ALlqYQtsT54IXE8zlagNpK9C2aBFLHWe1gJK76oIlqrf9VABM1sCj/70vK5nGXUmG9 FcnMCAJQX8rHESe5ebRmZ4XR/3ZKlCD3wVAfmApxTLFouzxo0R8DuPmqKTS+T3m9lQzu cTHInaseU7bFXSu8WmODBRucM++2e+85RZlz1B4XJhb7s+Yve2TQIauGy+32oNwz4l97 95p/mJZGs3MZ9N3m9uQFbQEsdFPN0swiEQN85P80Z6mn0HiKX2PfJxNWHwSoIGhAEea/ dGaA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1775800886; x=1776405686; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=WanuYdhvwQCFpE70Fc2DNW+G9CYcdeHRuqsNFzJ/EU0=; b=c0EZoI2O/rcJOP7XMDJ/puRts8BGcZl9NFkZs1zK5aLRnii3ODxmOcqe6jA9i4jCHf gfoOgFxFl1eiRNBZTWVGi0oE4eckvKN7PqcAP+jVzFLpAdpc4iM7o2KNKpQpD2xM9yoM cmNv3WsmGKSX0GxGVsZ3GhgqTe7mJrFse99vc1pTawhPo3L9q2HgD9KG6PzbKFn7wvoK 99x2VoaekYM8Ac4cOZtP/rP0lnCn/syl28I2Wy/ER0TZ6cVW3cCbunqKaKYVsszeOxPD /WlAH9XoDoRUwTNlonmj/gXulr4mbRcXcKH0ztgegAvB9iDZ56dqSRawtFrRwnap2xAT dKPQ== X-Forwarded-Encrypted: i=1; AJvYcCWdbl0AwgmWN/B3bOb/248Nr5MoimEPB+dwJIuwPyl2QYX/TKqfPgmam7Y0mHoB3jcRHGv1pWLg4fcMCn5q@lists.postgresql.org X-Gm-Message-State: AOJu0YxD1zqhRle1FNh9YTAkBU7nx1yHqpLALTeyay0JyJCuBUxfIQZk yPxA9AS2oyCyCHXbp8OSxCm7HwUU/7U0YnoCOGo6i44mH3ybYY+l9N8lmrY0dbxKpujtTtXVpvv O3H1pbkbc73MrU4kl+840mCRWe3PsbcM= X-Gm-Gg: AeBDievWAu/uhJ9gMzj3VLsNQalDPTmlmWdiZ2f6lC9ebXf6kPXh5KBsalLW0sh+qOf N6vT14bxffmZ49Bsze4lbpX7BKyow9DPsWxoRUPkdkiJH7dbMU9vcv418VY79vQvGWlJveaE/BC 1ZoALiN/HyVcRcgkWX/941lOv47su5xoRHVt9VZJYOOx/uvcIh/Z4ZVd4q/+Hd0RrdeSiFW+YqU Xopcx0v2p0UK9ksnYSZ8NwSJlpOSgj3gYTJKWn1kYVdmLUInVffeIXJyGsrgk7z6VYAUeQ6cadQ kTQS47YcHqS7J/7r22tHmLde5kAKnH8qZzOYB2449jGc5PEd+kmexYu2g05umBU= X-Received: by 2002:a05:6a20:4307:b0:39c:14b3:6234 with SMTP id adf61e73a8af0-39fe3cf22a2mr2246059637.15.1775800886458; Thu, 09 Apr 2026 23:01:26 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: shveta malik Date: Fri, 10 Apr 2026 11:31:14 +0530 X-Gm-Features: AQROBzAzIRhVcbUUSka992lwVTcTHsTPsj0X2pg7lsfU_x7Unf5uiTPPXhEOmwQ Message-ID: Subject: Re: Fix slotsync worker busy loop causing repeated log messages To: "Zhijie Hou (Fujitsu)" Cc: Amit Kapila , PostgreSQL Hackers , Fujii Masao , shveta malik Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Fri, Apr 10, 2026 at 8:28=E2=80=AFAM Zhijie Hou (Fujitsu) wrote: > > On Sunday, March 22, 2026 1:47 AM Amit Kapila w= rote: > > On Thu, Mar 19, 2026 at 7:08=E2=80=AFPM Fujii Masao > > wrote: > > > > > > On Sat, Feb 28, 2026 at 2:03=E2=80=AFPM Amit Kapila > > wrote: > > > > > > > > On Fri, Feb 27, 2026 at 8:34=E2=80=AFPM Fujii Masao > > wrote: > > > > > > > > > > Normally, the slotsync worker updates the standby slot using the > > > > > primary's slot state. However, when confirmed_flush_lsn matches > > > > > but restart_lsn does not, the worker does not actually update the > > > > > standby slot. Despite that, the current code of > > > > > update_local_synced_slot() appears to treat this situation as if > > > > > an update occurred. As a result, the worker sleeps only for the > > > > > minimum interval (200 ms) before retrying. In the next cycle, it > > > > > again assumes an update happened, and continues looping with the > > > > > short sleep interval, causing the repeated logical decoding log m= essages. > > Based on a quick analysis, this seems to be the root cause. > > > > > > > > > > I think update_local_synced_slot() should return false (i.e., no > > > > > update > > > > > happened) when confirmed_flush_lsn is equal but restart_lsn > > > > > differs between primary and standby. > > > > > > > > > > > > > We expect that in such a case update_local_synced_slot() should > > > > advance local_slot's 'restart_lsn' via > > > > LogicalSlotAdvanceAndCheckSnapState(), otherwise, it won't go in th= e > > > > cheap code path next time. Normally, restart_lsn advancement should > > > > happen when we process XLOG_RUNNING_XACTS and call > > > > SnapBuildProcessRunningXacts(). In this particular case as both > > > > restart_lsn and confirmed_flush_lsn are the same (0/03000140), the > > > > machinery may not be processing XLOG_RUNNING_XACTS record. I have > > > > not debugged the exact case yet but you can try by emitting some > > > > more records on publisher, it should let the standby advance the > > > > slot. It is possible that we can do something like you are proposin= g > > > > to silence the LOG messages but we should know what is going on her= e. > > > > > > I agree it's important to investigate why restart_lsn doesn't advance > > > in that case and fix that issue. > > > > > > Separately, I think mistreating the slot as updated is another proble= m: > > > the local slot is treated as updated even when it isn't, which causes > > > the slotsync worker to use the minimum sleep interval. We should > > > address this independently. > > > > > > LogicalSlotAdvanceAndCheckSnapState() doesn't always update the local > > > slot, but update_local_synced_slot() currently assumes it does. That > > > seems to be the root cause. The attached 0001 patch makes > > > LogicalSlotAdvanceAndCheckSnapState() return whether the slot was > > > actually updated, and updates update_local_synced_slot() to act based= on > > that result. > > > > > > I've also attached 0002, which reduces the log level of some logical > > > decoding messages (e.g., "logical decoding found consistent point") > > > from LOG to DEBUG1. These are low-level, developer-oriented messages > > > and can be noisy when they occur regularly. This is the same issue I > > reported earlier at [1]. > > > > > > How about applying these patches while we continue working on the > > > restart_lsn issue? > > > > > > > * > > @@ -2174,7 +2193,10 @@ > > LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto, > > > > if (XLogRecPtrIsValid(ctx->reader->EndRecPtr)) > > { > > - LogicalConfirmReceivedLocation(moveto); > > + bool slot_updated =3D LogicalConfirmReceivedLocation(moveto); > > + > > + if (updated_xmin_or_lsn) > > + *updated_xmin_or_lsn =3D slot_updated; > > > > BTW, LogicalSlotAdvanceAndCheckSnapState() could also advance slot whil= e > > processing running_xact record, so not sure we can rely only on the exp= licit > > call LogicalConfirmReceivedLocation() above to ascertain the same. > > Perhaps we could simply compare the slot's old and new LSN/xmin to determ= ine > whether updated_xmin_or_lsn needs to be set. This would avoid the need to= hook > into each individual update path at the lower level. Attaching a patch fo= r reference. > > > > > Sorry, I couldn't get the chance to look at the patches proposed by Hou= -san to > > fix this issue but I'll look at it after the feature freeze. > > The other patches attempt to advance restart_lsn in more cases by modifyi= ng the > logical decoding logic. However, those scenarios are relatively rare and = may not > produce significant improvement outside the slot synchronization context.= So, I > think modifying the slot sync worker to correctly increase the polling in= terval instead > is also a good starting point. > After reviewing the entire thread and patches, the above solution appears to be the simplest and least intrusive, especially since the issue is limited to the sltosync worker. So +1 for this idea. thanks Shveta