Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w5Eu2-0032tJ-04 for pgsql-hackers@arkaria.postgresql.org; Wed, 25 Mar 2026 03:27:30 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w5Eu0-00BIun-1i for pgsql-hackers@arkaria.postgresql.org; Wed, 25 Mar 2026 03:27:28 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w5Eu0-00BIuf-0q for pgsql-hackers@lists.postgresql.org; Wed, 25 Mar 2026 03:27:28 +0000 Received: from m16.mail.163.com ([117.135.210.2]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1w5Ett-00000000zgY-2KyJ for pgsql-hackers@lists.postgresql.org; Wed, 25 Mar 2026 03:27:27 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=163.com; s=s110527; h=From:To:Subject:Date:Message-ID:MIME-Version: Content-Type; bh=+iIPu/d26zNOlDV8RdYAmWyESH7kxF58Oc2FeBkkV/g=; b=XVJy0n1F1mUMTk71VUKBs9ltKmF8WCKzELFM1mAfc14Gb74mP/bjcqDia6my/9 ++KNHT/a+PGp8AT9SeAP1OUv0qnijjLIlLtzcVyXHToMAhHZT/IGP9ho9+BYIrF7 WCVHNwori4zeuMTMvw+wpm3jAc6LPRgghMRWdyQQ7hCTM= Received: from andy-coding (unknown []) by gzga-smtp-mtada-g1-3 (Coremail) with SMTP id _____wDXQLoOVsNpjSEeBQ--.48809S3; Wed, 25 Mar 2026 11:27:10 +0800 (CST) From: Andy Fan To: Michael Paquier Cc: PostgreSQL Hackers Subject: Re: raise ERROR between EndPrepare and PostPrepare_Locks causes ROLLBACK 2pc PAINC In-Reply-To: (Michael Paquier's message of "Wed, 25 Mar 2026 10:50:53 +0900") References: <87341p7dc4.fsf@163.com> <87h5q468us.fsf@163.com> User-Agent: mu4e 1.14.0-pre2; emacs 30.2 Date: Wed, 25 Mar 2026 11:27:10 +0800 Message-ID: <874im44mi9.fsf@163.com> MIME-Version: 1.0 Content-Type: text/plain X-CM-TRANSID: _____wDXQLoOVsNpjSEeBQ--.48809S3 X-Coremail-Antispam: 1Uf129KBjvJXoWxWr4UKr45Xw13CrW8uF15urg_yoW5uF1kpF Z8Kas0yrWkAryIvwnrXw48ZFyIvws5AFW5Gr15GFWqk3Z0vF1SqF4xKFyqvasI9r4xWw1j grWktryDGF4qvFJanT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x0zM5l89UUUUU= X-Originating-IP: [101.227.46.165] X-CM-SenderInfo: x2klx3xlid0iqsrtqiywtou0bp/xtbC4xDkOGnDVhBR3gAA3M List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Michael Paquier writes: Hi, Thanks for showing interests for this topic! > On Wed, Mar 25, 2026 at 08:39:07AM +0800, Andy Fan wrote: >> I found a similar but not exactly same case at 2014 [1] which >> might be helpful to recall a boarder understanding on this area. >> >> [1] https://www.postgresql.org/message-id/534AF601.1030007%40vmware.com > > Incorrect shared state when an ERROR happens at an arbitrary location > is usually bad, yes. > > For this one, your suggestion of delaying the end of the critical > section started at StartPrepare() and ending in EndPrepare() is not an > acceptable solution as far as I can see, unfortunately: it would mean > doing a SyncRepWaitForLSN() while in a critical section, and I doubt > we'd want to do that. I do have a more general question: (Q1). what is critical section designed to? (Q2). What is the badness if we put more code into it except the ERROR->PANIC logic? Then (Q3) will SyncRepWaitForLSN be possible to raise ERROR? if no for now and possible in future, then my Q2 raise. > Anyway, I doubt that this one is worth caring for. The current locking > 2PC scheme means, as far as I remember, that it is not really possible > to interact with an external command in a specific session between > the EndPrepare() and the PostPrepare_Locks() > calls. Then Q3 comes. The deeper answer might be Q2 or Q1. > To put it in other words, let's imagine that we use a breakpoint > between these two calls (or a wait injection point if you automate > that). Is it possible for a second backend to mess with the state of > the first backend waiting until its locks are transfered to the dummy > PGPROC entry? That's what the 2014 thread is about: there was a race > condition reachable between two sessions. This is true, so the issue 2014 thread is more critical than the current one and which has been fixed. > If the answer to this question is yes, I'd agree that this is > something that deserves a closer lookup. Generally yes.. But I can't stop thinking the Q3 -> Q2 -> Q1 when I want to accept this asnwer. > And before you ask: attempting to interact with a 2PC > state from a second session with a first session waiting between these > two points would not work: the 2PC entry is locked, cleaned up after > EndPrepare() and PostPrepare_Locks() at PostPrepare_Twophase(). > Trying to request an access to this entry fails, as the first backend > is marked as locking it. A second backend attempting to lock it would > fail, complaining that the 2PC entry with a GXID is "busy". I can understand what you are saying now, but what does it make difference on the above case? > SyncRepWaitForLSN() would be a problematic pattern between the > EndPrepare() and the PostPrepare_Locks(), but we never ERROR there on > purpose: even if we cancel while waiting for a transaction commit we'd > just get a WARNING, meaning that we'd be able to transfer our locks > anyway. again Q2 -> Q3. > Or perhaps you have a realistic scenario where it is possible to mess > up with the shared state, outside a elog(ERROR) forced between these > two points? No really. I just inject some exception on some predefined places. I even don't know why people defined this places before. As for me, I prefer to know MORE design points for the CRITIAL SECTION besides what I side before. -- Best Regards Andy Fan