Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wWVOo-002f7l-1d for pgsql-hackers@arkaria.postgresql.org; Mon, 08 Jun 2026 08:31:58 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wWVOm-001N5s-2l for pgsql-hackers@arkaria.postgresql.org; Mon, 08 Jun 2026 08:31:56 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wWVOm-001N5j-1O for pgsql-hackers@lists.postgresql.org; Mon, 08 Jun 2026 08:31:56 +0000 Received: from mail-ej1-x62d.google.com ([2a00:1450:4864:20::62d]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wWVOk-00000001erE-2BGG for pgsql-hackers@postgresql.org; Mon, 08 Jun 2026 08:31:55 +0000 Received: by mail-ej1-x62d.google.com with SMTP id a640c23a62f3a-bec405a6ea5so574831766b.0 for ; Mon, 08 Jun 2026 01:31:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1780907513; cv=none; d=google.com; s=arc-20240605; b=ZwMyRaUW4xzFyINmCbQHTEu1sDovApW7PKR+1svShHsusuoHMC+ZltzdKX1cKbbtW6 hKxQzXr87wJZl+HfFlDhgcZNp4a/12EITUhBEcUOjFMHWKk+NKxHkD949n2p2T5/44pE Lo19wQQDBnx1HSYO1l6KQkRY8RRcuDuaBIBDMF+3M/iIlvgDTCuXJqB3JAIqZ/ihcTnp TSTBINWBcAYtBYaCSUiFhbplY0XOzJSf7ViC3dlNjonZXaRy76E8QN50fIRest8wO7I+ /MlL6wmj14MB5eX2vmykVkvSn1ITMJw7rewEkqHYpJtES77OwhizTCm7uRpzweN8Uar0 sVHw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=/VHKMPI4wvhHWp8kkxBMNohaSTCayVzNcMVsA1MXvyg=; fh=F+DjkJ78IXyS2b0Ljlg77KbkjMocTAEOQEa9ucew/to=; b=drKhrzAwppn/7XYHJCzinbHW/tY+6pvbYlLYWVkz8sv7q8JlxilZbhH/Zkxnt1sgyR PCFayWPMtPu38HBYjR8NwfqgwtQIWG3C7I2ZZ7u+IHiKRhPdGUrDcYHJD14uLoJ+vBwT aKln7TUfY4TsUO0oRdbwgDEFSMSPmPxOqTG7rbostLFQJ7MNrE6cX8E8UMeBP0l9fNwP 8j9vVjoHmnTUP+txXnBItg7aUGSwmdH0kjA9XJ15DSUgg/BNUsU9Ir+xIH+8hLYgqcOS b2onFU4XtG24h3j8FXWmXNFZMRSlpejg6adAcmtH33SCQZfzdnbY1nhf2AWfwAMs9taA nOPg==; darn=postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1780907513; x=1781512313; darn=postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=/VHKMPI4wvhHWp8kkxBMNohaSTCayVzNcMVsA1MXvyg=; b=nlg7VIs7+oACY6qCvOuElOfC97IVjtmF90zRbG7EaZajcggdPi9woTn/xuAXNLUDWZ ijiEiJbwgwMqvev3UAQ8QBHVMq3raSLEbzm3dKoTmLwisg/Scx1D/bya7JMJ59f/JJDj M9mV6yz2MYooFWxv4hsge6b+z9yIm+S10Out31vzwI61YjfU4/30rMrY/xfW7+tQRy+w tfRXQwtw+7E9rxB07iK04j/Kof/GV9LocbyDv4xhvMzDOqwUIl1iJ7M7yMwVQ9f912KA gXKzl3pSD9TIA/dePD18NmdxkB94mC09lX2JzZUttGg0cNeXyr9Oav/z1DU/olqzo+1+ kdrQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1780907513; x=1781512313; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=/VHKMPI4wvhHWp8kkxBMNohaSTCayVzNcMVsA1MXvyg=; b=tSsTIhOfiz7IKbrWkWQp5ud0mwbma0ZCIIG6F6pFK4gBtW+5Ln6AOUF/1ofM2Opu+8 0InudQ8i+KMowde3vdgPDfJ8e79NFyIPUUvJDayHDaPY+CoQlgwLQjQ5SZT00KSMdPGL XfLoV8WLmtHEQPZLZAEXvpnJwX2bZtpXcas+cr5mhsPDCghkD//MjYErfNjqutKfxi+t qTZ8WDvQbaFzM8e77oSd61PT/z+wihrE2gl/hlclL9FPGQXB92/VYEB4l9jhdshRdi2x nLE4qJ16EONkcgYkjiCBGuA+Mhzk9bg/lhwyw+mXeHvHLcImQ3sDL5/RqPdZaHIAAyxV d/ww== X-Forwarded-Encrypted: i=1; AFNElJ8kkjcF0kt2YY2WFgWdD6UFpT72nn8Zkpb6cLPdf0cIlDSFh8wpe6daiLuNSCP6bIwQiTzOoiIGGEZouFit@postgresql.org X-Gm-Message-State: AOJu0Ywwm6iVhJTBQTDrAmpoykeGlaWN7/8XMGJiyQ0f9s/1yyZfpT01 8w/HoTImTG6XRre8ALL0U/brpXPZ/vj3+suD5LG8krxFNbJlmMdWZ+7KSfNH2KDXMrsTTzsVvL6 bIFmvxfB0lXv+hyn2kHeGEc8WYgSsaao= X-Gm-Gg: Acq92OEQIdNXQ19jlORgx+T37caHIWunNdRrasEE27aXfRI4APeTxvDm6Cr5aq/Nsy/ fKoB6fGsbsxKTDAJQ6c1gC+XUhbkxvC22pRqo7ALBWIUcsxmX5/5dEX7w6Tw1Yqgcj68fegM2Mp WQ+AhYunz7ehkiX44PLKRD8hC5LTCG5ys5t/iAZW+HLDZ/998QfqAJ/9MwnQN4zddvqqNTo5heR FiFLmXDt49sJD5JCrfwmUMBVFOAomley5KdBnvcYz5pran6xL7FWlL7xmvfq8mf2q2HNe9E5ITN LYI5bnu6h9fAzouhh8Ivw/OlfkNGkvuXSDKncr4IaI7ZbTx+IDrasVoPH3gtMiwSghzh55m0KVU 1oENvDDTo9Jr2XpM1LDRZ6d3SNIE7hdX4X4bnvW7qFYqd6RR3dqvWUUWlDZMx X-Received: by 2002:a17:907:75f2:b0:bec:2ad0:cba5 with SMTP id a640c23a62f3a-bf372159cbemr394359566b.29.1780907512669; Mon, 08 Jun 2026 01:31:52 -0700 (PDT) MIME-Version: 1.0 References: <7daef094-abf3-4672-bc23-3df4763b16a3@gmail.com> In-Reply-To: From: Xuneng Zhou Date: Mon, 8 Jun 2026 16:31:39 +0800 X-Gm-Features: AVVi8CetPxcqqRGEYDNpLp-_bQmfjG8onM771i20lIFXWFzxEx98beJ18IATHFY Message-ID: Subject: Re: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline To: "Hayato Kuroda (Fujitsu)" Cc: Bertrand Drouvot , Alexander Lakhin , pgsql-hackers Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi Kuroda-san, On Mon, Jun 8, 2026 at 12:25=E2=80=AFPM Hayato Kuroda (Fujitsu) wrote: > > Hi Alexander, Bertrand, Xuneng, > > Thanks for seeing the failure. Our team also recognized but could not fin= d the reason. > > > Yeah, it looks like there is a race condition here. I think we should c= heck if > > the insertion timeline has already been set (like the walsummarizer is = doing). > > Sorry for stupid question; I tried to reproduce the failure but could not= , see attached. > > IIUC, the issue can happen if the walsender must read the WAL record gene= rated > after the promotion but the timeline could not be updated. I think the race is that the logical walsender is trying to read the pre-promotion wal from the slot's restart_lsn. 1) In 035_standby_logical_decoding.pl, the active slot / pg_recvlogical process is started before: INSERT rows 1..4 # before promotion promote standby INSERT rows 5..7 # after promotion 2) Alexander=E2=80=99s reproducer sleeps immediately after: ReplicationSlotAcquire(cmd->slotname, true, true); pg_usleep(200000); So with the delay: slot is active but walsender has not yet create decoding context and called xlogbeginread(); 3) Then the test continues and promotes the standby while the walsender still sleeping, therefore promotion could happen before the walsender starts its actual wal read. 4) The logical walsender is trying to read the pre-promotion wal from the slot's restart_lsn. The failing log implies this: Streaming transactions committing after 0/06487A20, reading WAL from 0/064879E8. This line is emitted from: errdetail("Streaming transactions committing after %X/%08X, reading WAL from %X/%08X.", slot->data.confirmed_flush, slot->data.restart_lsn) Function StartLogicalReplication() initiates the xlogreader at the restart = lsn: XLogBeginRead(logical_decoding_ctx->reader, MyReplicationSlot->data.restart_lsn); Because the slot was created and acquired before promotion, its restart_lsn is from the standby-era state which points to wal required for decoding the pre-promotion stream. In the test, the first expected output is rows 1..4, inserted before promotion. The failing output was empty: got: '' expected: BEGIN rows 1..4 COMMIT BEGIN rows 5..7 COMMIT Empty output suggests that the walsender died before emitting even the first pre-promotion transaction. 5) The walsender try to read wal but failed Failed run: LOG: !!!WalSndSegmentOpen| nextSegNo: 6, path: pg_wal/000000010000000000000006 ERROR: requested WAL segment 000000010000000000000006 has already been removed Passed run LOG: !!!WalSndSegmentOpen| nextSegNo: 6, path: pg_wal/000000020000000000000006 As the log suggested, the walsender of the successful run readed the wal from timeline 2 but failed to do so for timeline1. This comment in xlogutils.is almost exactly the issue: We care about timelines ... reading xlog generated prior to a promotion ... The server copied the segment to the new timeline ... there's no guarantee the old segment will still exist. It may have been deleted or renamed with a .partial suffix So the need for tli1 does not always imply that the wal was generated after promotion. It could be bytes from before the switchpoint, but they now be available through the new tli segment file. 6) XLogReadDetermineTimeline() needs the current system-wide TLI. The current code in does roughly: am_cascading_walsender =3D RecoveryInProgress(); if (am_cascading_walsender) GetXLogReplayRecPtr(&currTLI); else currTLI =3D GetWALInsertionTimeLine(); During promotion, we set the wal insertion timeline before marking recovery as fully done. So there is a window where: - InsertTimeLineID is already the new timeline, e.g. tli 2. - RecoveryInProgress() can still return true in another backend. - A logical walsender that started before promotion can wake up and choose its read timeline using the old recovery-state test. In the race, replay tli 1 is chosen even though the promotion has created the new tli 2 wal segment copy. Then WalSndSegmentOpen() tries to open 000000010000000000000006, which may already be gone or renamed, while the correct file is 000000020000000000000006. 7) Bertrand's proposal to fix this Recognizing the intermediate promotion state -- if recovery still shows progress but the insertion tli is already advanced, use that insertion tli for the timeline to be chosen. > However, I think logical_read_xlog_page() is called after the new WAL rec= ords > are generated, i.e., am_cascading_walsender has already been false at tha= t time. > So not sure where is the race? This function is not only called after new post-promotion records are generated. When START_REPLICATION begins, CreateDecodingContext() / XLogBeginRead() starts from the slot's restart_lsn, which can be well before promotion. -- Regards, Xuneng Zhou HighGo Software Co., Ltd.