Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1tqbIm-007ZGD-Ru for pgsql-hackers@arkaria.postgresql.org; Fri, 07 Mar 2025 17:16:01 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1tqbIl-00CVLs-M4 for pgsql-hackers@arkaria.postgresql.org; Fri, 07 Mar 2025 17:15:59 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1tqbIQ-00CRqK-D7 for pgsql-hackers@lists.postgresql.org; Fri, 07 Mar 2025 17:15:38 +0000 Received: from mail-ej1-x62b.google.com ([2a00:1450:4864:20::62b]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1tqbIM-001XJz-33 for pgsql-hackers@postgresql.org; Fri, 07 Mar 2025 17:15:38 +0000 Received: by mail-ej1-x62b.google.com with SMTP id a640c23a62f3a-ac256f4b3ecso159234966b.0 for ; Fri, 07 Mar 2025 09:15:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1741367735; x=1741972535; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=VKLA1Jw2XKIt4mqDR4bGwbT0g9eCJy1Sq+bzX9P/J5I=; b=HQQo4gM2wuQcVp0xU7a3s/TePHopR0JIsvZNsb4/tMbzjAp/eJZZMdBAAFU5CXEwMT r4UrGfAOk5YL/LK0f8OY6sjH6N+KMY/8qi1Qt5RZ5kzPlXEwddIpvawcxTq9QE/lxD7d YsH5vkNc9dNS4lhPX5HusvQjkZLlNAI6JPQCzW/xxEm93Rj2F2VG2/iRkka/7BQaq5vO +oeFzCDmlGDHz0BE1s+tws4kaAKDQ0oM/+BP4+Fz2NfImekfvjSTeH0cfppFtubYvJIv Hls78ju1i9A4Qq1N3rfDo+Er9+tDW4/UK7+F6w0ggUUY215nJ7BziyMIGNx14r5sb78Q fpZw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1741367735; x=1741972535; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=VKLA1Jw2XKIt4mqDR4bGwbT0g9eCJy1Sq+bzX9P/J5I=; b=SgyEoUcv2Tq/ceyz9+had33zTqutYQ6fCzm6Z3LzaJsw6+wtFOEB7TBOuuc/74jDS3 1sCnTimeiZwYrwnlhIfQUZaAhe/Y0iyW5HWZwmuslDBwUh5k6772/L6K8lOwVLVSxOGG cJ50eJh0/rcTRrxJV0AXcar62S1N8oI1SmlVK34efS+ffTlHvTldxOi2CcUBJItMWUV9 zpBm1u1QO+vO5W2pu3q+yUEQEi7BwReEphjKspr6vXEK8yS0GMQf0fNFSLnBS6qxXEFX G1Tmed+wP+DnhA2zmwxHTDuMt8bjApKP8yhn531mSayID9b4E9IwuCD6N5/Ka1u4o7sm 3eTw== X-Gm-Message-State: AOJu0Yzvi71eU3An9bmuJHZkUHggYTCr2u/yGSD9wSldCRpV03ONJFmo /5pdKLK7lySpOLPTMj+QSiLXZlednOl/zSyCegqlt8ipzTRbMV/Vy5d2t8nTm/5igeRxY4G7Sp7 4FB8Wlq1hm70aG8/pd4DLDHdRB/c= X-Gm-Gg: ASbGncs6QUyxFRle5NKiks4zwmeZjR4XUmxSb+gFMXm1a7Ah9RVxLj60b2rn5zkewgj 4exihoIb70b6Ewv5mO/Fya+WoX2cG+sQVPhUQpBoLl2FpzCTDn02FaEUumTLNJN5asjNZ7/vEmi bqnBQuvRHm8ll8Vu510hB6GpBcZQ== X-Google-Smtp-Source: AGHT+IHj8TzUC01zs4KqRA/t5OpZZPu5aonh6qKMMveh5suAWTYMe1zaKJ9LjWwY1KKRrpLrg3IVnUPHKuzWtlJKej0= X-Received: by 2002:a17:907:61a5:b0:ac0:4364:407e with SMTP id a640c23a62f3a-ac26c9cf17amr41545266b.4.1741367734936; Fri, 07 Mar 2025 09:15:34 -0800 (PST) MIME-Version: 1.0 References: <2muwyx6a5vojkg7iegknhnkcch3lfxptsxk7icwuh7szkvvu2y@vc3ukkfvnu6i> In-Reply-To: From: Alexander Korotkov Date: Fri, 7 Mar 2025 19:15:23 +0200 X-Gm-Features: AQ5f1JqqTGKWLHoPWlPBMTUTEE44k4liLuVgjlabtrTaUtyp-Ln3DQoNyV6ZZEU Message-ID: Subject: Re: pg_atomic_compare_exchange_*() and memory barriers To: Andres Freund Cc: pgsql-hackers Content-Type: multipart/alternative; boundary="00000000000043ac3d062fc3c55f" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --00000000000043ac3d062fc3c55f Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Fri, Mar 7, 2025 at 7:07=E2=80=AFPM Andres Freund w= rote: > On 2025-03-07 18:39:42 +0200, Alexander Korotkov wrote: > > On Fri, Mar 7, 2025 at 6:02=E2=80=AFPM Andres Freund wrote: > > > > > > On 2025-03-07 17:47:08 +0200, Alexander Korotkov wrote: > > > > While investigating a bug in the patch to get rid of WALBufMappingLock, I > > > > found that the surrounding pg_atomic_compare_exchange_u64() fixes the > > > > problem for me. > > > > > > That sounds more likely to be due to slowing down things enough to make a race > > > less likely to be hit. Or a compiler bug. > > > > > > > > > > That doesn't feel right because, according to the > > > > comments, both pg_atomic_compare_exchange_u32() and > > > > pg_atomic_compare_exchange_u64() should provide full memory barrier > > > > semantics. So, I decided to investigate this further. > > > > > > > > In my case, the implementation of pg_atomic_compare_exchange_u64() is based > > > > on __atomic_compare_exchange_n(). > > > > > > > > static inline bool > > > > pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr, > > > > uint64 *expected, uint64 newval= ) > > > > { > > > > AssertPointerAlignment(expected, 8); > > > > return __atomic_compare_exchange_n(&ptr->value, expected, newval, false, > > > > __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); > > > > } > > > > > > > > According to the docs __ATOMIC_SEQ_CST enforces total ordering with *all > > > > other __ATOMIC_SEQ_CST operations*. It only says about other > > > > __ATOMIC_SEQ_CST operations; nothing is said about regular reads/writes. > > > > This sounds quite far from our comment, promising full barrier semantics, > > > > which, in my understanding, means the equivalent of > > > > both pg_read_barrier()/pg_write_barrier(), which should barrier *all other > > > > read/writes*. > > > > > > A memory barrier in one process/thread always needs be paired with a barrier > > > in another process/thread. It's not enough to have a memory barrier on one > > > side but not the other. > > > > Yes, there surely should be a memory barrier on another side. But > > does __ATOMIC_SEQ_CST works as a barrier for the regular read/write > > operations on the same side? > > Yes, if it's paired with another barrier on the other side. The regular > read/write operations are basically protected transitively, due to > > a) An acquire barrier preventing non-atomic reads/writes in the same thread > from appearing to have been moved to before the barrier > > b) A release barrier preventing non-atomic reads/writes in the same threa= d > from appearing to have been moved to after the barrier > > c) The other thread being guaranteed a) and b) for the other threads' > non-atomic reads/writes depending on the type of barrier > > d) The atomic value itself being guaranteed to be, well, atomic Yes, looks good to me. > > > > This function first checks if LSE instructions are present. If so, > > > > instruction LSE casal is used. If not (in my case), the loop of > > > > ldaxr/stlxr is used instead. The documentation says both ldaxr/stlxr > > > > provides one-way barriers. Read/writes after ldaxr will be observed after, > > > > and read/writes before stlxr will be observed before. So, a pair of these > > > > instructions provides a full memory barrier. However, if we don't observe > > > > the expected value, only ldaxr gets executed. So, an unsuccessful > > > > pg_atomic_compare_exchange_u64() attempt will leave us with a one-way > > > > barrier, and that caused a bug in my case. > > > > > > That has to be a compiler bug. We specify __ATOMIC_SEQ_CST for both > > > success_memorder *and* failure_memorder. > > > > > > What compiler & version is this? > > > > I've tried and got the same results with two compilers. > > gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 > > Ubuntu clang version 19.1.1 (1ubuntu1~24.04.2) > > Thinking more about it I wonder if the behaviour of not doing a release i= n > case the atomic fails *might* arguably actually be correct - if the > compare-exchange fails, there's nothing that the non-atomic values could be > ordered against. > > What is the access pattern and the observed problems with it that made yo= u > look at the disassembly? Check this code. l1: pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], NewPageEndPtr); /* * Try to advance XLogCtl->InitializedUpTo. * * If the CAS operation failed, then some of previous pages are not * initialized yet, and this backend gives up. * * Since initializer of next page might give up on advancing of * InitializedUpTo, this backend have to attempt advancing until it * find page "in the past" or concurrent backend succeeded at * advancing. When we finish advancing XLogCtl->InitializedUpTo, w= e * notify all the waiters with XLogCtl->InitializedUpToCondVar. */ l2: while (pg_atomic_compare_exchange_u64(&XLogCtl->InitializedUpTo, &NewPageBeginPtr, NewPageEndPtr)) { NewPageBeginPtr =3D NewPageEndPtr; NewPageEndPtr =3D NewPageBeginPtr + XLOG_BLCKSZ; nextidx =3D XLogRecPtrToBufIdx(NewPageBeginPtr); l3: if (pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) !=3D NewPageEndPtr) { /* * Page at nextidx wasn't initialized yet, so we cann't mov= e * InitializedUpto further. It will be moved by backend which * will initialize nextidx. */ ConditionVariableBroadcast(&XLogCtl->InitializedUpToCondVar); break; } } Consider the following execution order with process 1 (p1) and process 2 (p2). 1. p1 executes l1 2. p2 executes l2 with success 3. p1 executes l2 with failure 4. p2 execute l3, but doesn't see the results of step 1, because 3 didn't provide enough of memory barrier ------ Regards, Alexander Korotkov Supabase --00000000000043ac3d062fc3c55f Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Fri, Mar 7, 2025 at 7:07=E2=80=AFPM Andres Freund <<= a href=3D"mailto:andres@anarazel.de">andres@anarazel.de> wrote:
&= gt; On 2025-03-07 18:39:42 +0200, Alexander Korotkov wrote:
> > On= Fri, Mar 7, 2025 at 6:02=E2=80=AFPM Andres Freund <andres@anarazel.de> wrote:
> > >
&g= t; > > On 2025-03-07 17:47:08 +0200, Alexander Korotkov wrote:
>= ; > > > While investigating a bug in the patch to get rid of WALBu= fMappingLock, I
> > > > found that the surrounding pg_atomic= _compare_exchange_u64() fixes the
> > > > problem for me.> > >
> > > That sounds more likely to be due to slow= ing down things enough to make a race
> > > less likely to be h= it. Or a compiler bug.
> > >
> > >
> > >= ; > That doesn't feel right because, according to the
> > &= gt; > comments, both pg_atomic_compare_exchange_u32() and
> > &= gt; > pg_atomic_compare_exchange_u64() should provide full memory barrie= r
> > > > semantics.=C2=A0 So, I decided to investigate this= further.
> > > >
> > > > In my case, the imp= lementation of pg_atomic_compare_exchange_u64() is based
> > > = > on __atomic_compare_exchange_n().
> > > >
> > = > > static inline bool
> > > > pg_atomic_compare_excha= nge_u64_impl(volatile pg_atomic_uint64 *ptr,
> > > > =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 uint64 *expected, uint64 newv= al)
> > > > {
> > > > =C2=A0 =C2=A0 AssertPoi= nterAlignment(expected, 8);
> > > > =C2=A0 =C2=A0 return __a= tomic_compare_exchange_n(&ptr->value, expected, newval, false,
&g= t; > > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
> > > > }
= > > > >
> > > > According to the docs __ATOMIC_S= EQ_CST enforces total ordering with *all
> > > > other __ATO= MIC_SEQ_CST operations*.=C2=A0 It only says about other
> > > &= gt; __ATOMIC_SEQ_CST operations; nothing is said about regular reads/writes= .
> > > > This sounds quite far from our comment, promising = full barrier semantics,
> > > > which, in my understanding, = means the equivalent of
> > > > both pg_read_barrier()/pg_wr= ite_barrier(), which should barrier *all other
> > > > read/= writes*.
> > >
> > > A memory barrier in one proces= s/thread always needs be paired with a barrier
> > > in another= process/thread. It's not enough to have a memory barrier on one
>= ; > > side but not the other.
> >
> > Yes, there su= rely should be a memory barrier on another side.=C2=A0 But
> > doe= s __ATOMIC_SEQ_CST works as a barrier for the regular read/write
> &g= t; operations on the same side?
>
> Yes, if it's paired wit= h another barrier on the other side. The regular
> read/write operati= ons are basically protected transitively, due to
>
> a) An acqu= ire barrier preventing non-atomic reads/writes in the same thread
> = =C2=A0 =C2=A0from appearing to have been moved to before the barrier
>= ;
> b) A release barrier preventing non-atomic reads/writes in the sa= me thread
> =C2=A0 =C2=A0from appearing to have been moved to after t= he barrier
>
> c) The other thread being guaranteed a) and b) f= or the other threads'
> =C2=A0 =C2=A0non-atomic reads/writes depe= nding on the type of barrier
>
> d) The atomic value itself bei= ng guaranteed to be, well, atomic

Yes, looks good to me.

>= > > > This function first checks if LSE instructions are present.= =C2=A0 If so,
> > > > instruction LSE casal is used.=C2=A0 I= f not (in my case), the loop of
> > > > ldaxr/stlxr is used = instead.=C2=A0 The documentation says both ldaxr/stlxr
> > > &g= t; provides one-way barriers.=C2=A0 Read/writes after ldaxr will be observe= d after,
> > > > and read/writes before stlxr will be observ= ed before.=C2=A0 So, a pair of these
> > > > instructions pr= ovides a full memory barrier.=C2=A0 However, if we don't observe
>= ; > > > the expected value, only ldaxr gets executed.=C2=A0 So, an= unsuccessful
> > > > pg_atomic_compare_exchange_u64() attem= pt will leave us with a one-way
> > > > barrier, and that ca= used a bug in my case.
> > >
> > > That has to be a= compiler bug.=C2=A0 We specify __ATOMIC_SEQ_CST for both
> > >= success_memorder *and* failure_memorder.
> > >
> > &g= t; What compiler & version is this?
> >
> > I've = tried and got the same results with two compilers.
> > gcc (Ubuntu= 13.3.0-6ubuntu2~24.04) 13.3.0
> > Ubuntu clang version 19.1.1 (1u= buntu1~24.04.2)
>
> Thinking more about it I wonder if the beha= viour of not doing a release in
> case the atomic fails *might* argua= bly actually be correct - if the
> compare-exchange fails, there'= s nothing that the non-atomic values could be
> ordered against.
&= gt;
> What is the access pattern and the observed problems with it th= at made you
> look at the disassembly?

Check this code.
l1:=C2=A0 =C2=A0 =C2=A0pg_atomic_write_u64(&X= LogCtl->xlblocks[nextidx], NewPageEndPtr);

=C2=A0 =C2=A0 =C2=A0 = =C2=A0 /*
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* Try to advance XLogCtl->= ;InitializedUpTo.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0* If the CAS operation failed, then some of previous pa= ges are not
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* initialized yet, and thi= s backend gives up.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*
=C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0* Since initializer of next page might give up on adva= ncing of
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* InitializedUpTo, this backe= nd have to attempt advancing until it
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= * find page "in the past" or concurrent backend succeeded at
= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* advancing.=C2=A0 When we finish advanci= ng XLogCtl->InitializedUpTo, we
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* n= otify all the waiters with XLogCtl->InitializedUpToCondVar.
=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0*/
l2:=C2=A0 =C2=A0 =C2=A0while (pg_atomic_co= mpare_exchange_u64(&XLogCtl->InitializedUpTo, &NewPageBeginPtr, = NewPageEndPtr))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 {
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 NewPageBeginPtr =3D NewPageEndPtr;
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 NewPageEndPtr =3D NewPageBeginPtr + XLOG_BLCKSZ= ;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 nextidx =3D XLogRecPtrToBufI= dx(NewPageBeginPtr);

l3:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (pg_ato= mic_read_u64(&XLogCtl->xlblocks[nextidx]) !=3D NewPageEndPtr)
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 /*
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0* Page at nextidx wasn't initialized yet, so we can= n't move
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0* InitializedUpto further. It will be moved by backend which
=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* will initialize ne= xtidx.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*/<= br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ConditionVariabl= eBroadcast(&XLogCtl->InitializedUpToCondVar);
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 break;
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 }
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }

Consider= the following execution order with process 1 (p1) and process 2 (p2).
=
1. p1 executes l1
2. p2 executes l2 with success
3. p1 executes l2 with failure
4. p2 execute l3, bu= t doesn't see the results of step 1, because 3 didn't provide enoug= h of memory barrier

------
Regards,
Alexander Korotkov
Sup= abase

--00000000000043ac3d062fc3c55f--