Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1tqt2f-00BbQH-91 for pgsql-hackers@arkaria.postgresql.org; Sat, 08 Mar 2025 12:12:33 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1tqt2c-00GeOT-9i for pgsql-hackers@arkaria.postgresql.org; Sat, 08 Mar 2025 12:12:30 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1tqt2b-00GeOL-SI for pgsql-hackers@lists.postgresql.org; Sat, 08 Mar 2025 12:12:29 +0000 Received: from mail-ed1-x52d.google.com ([2a00:1450:4864:20::52d]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1tqt2Y-001g6S-0j for pgsql-hackers@postgresql.org; Sat, 08 Mar 2025 12:12:29 +0000 Received: by mail-ed1-x52d.google.com with SMTP id 4fb4d7f45d1cf-5e058ca6806so5308561a12.3 for ; Sat, 08 Mar 2025 04:12:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1741435945; x=1742040745; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=3Kho7rk/4RHSFln91rFLJ//an1zPNSmMuJFoUwG9rpQ=; b=b8jQ9zZWb4EUb5XAGgJbFNsf5tHo1n5zB60QBBSw1Om/iFi8VTL1TUDyJP5lFCKF7a Uubgtm2qltSewItN0Dw8pVcXSCDo8nIppIGAqK2JaeUDJ1C701+TMI65/ZLfG4p0taBC jeirJW0U4Yf5h8TmgpTfA5SPs65mh4BjW+41MCgINpjAVtcWth4rOG0Eq0vDiP2xqPUp pQqV6yssdxFjPlsnHhaaA9LRgv+R64v3qa7gq+SQDQPoMm1SBunrb69vD70G0SEPlqws GFNHIMB4MQ/NhUIQnP4CthqJ4sg3Bx8mb7ElO4lC7I3kPyvaZiTLlApJMM82bkIIvBx7 dlRw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1741435945; x=1742040745; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=3Kho7rk/4RHSFln91rFLJ//an1zPNSmMuJFoUwG9rpQ=; b=EMyejZbsBWmJdxfPD4CKAHHBnwSe0mEjf7jXnW9B4CywGChIBW8m8Zx4KFVxp0nboB /Gv1KLm6BDCYPsBgc238aXaFWr9A8g+4qj0kbnKY0shfVesjhY5v2uuo1D7pq8tOjNTY xjhYxGyV66gCdhTg2XY3OBmHTJd1wVSDuFGqYqLLGSbnHqTLMKnu7qXnSwhpBIMGI728 ySBrVS0wlJ42uMO7qCylkwvD5bN4iFRqga8loJR4tJYp4D34l6RPZfbGcrAt53bEJys6 rnSwjjBt8GQhsNp4jXeOZuhrfB9cb5lRNYm3keBx3V2fiZ0gHuaLKq4Sfst4vptWHqTs 3Cmw== X-Gm-Message-State: AOJu0YyfQ33bow60d+io1hpKEQYZ26g0y8t9qiVsffe86BR3G5pKJDKj LoAU6KsbCSkgoubII+igBLUpenoJ+cRRVwA+EZoCKCc6YW/mXP5Llo7sc/yoOUh6mQMOWzTuuXv r1iJ7z2Yva9cvTezUKXPCYxvsdkYDhPdM X-Gm-Gg: ASbGnctSxWst1wO34R/EZ0Yc4fzdDYGGGZwOFW0FcZ8y922fTfeIwEjtW1ZSS+57J+8 GFGTXWGUtD5apfjhyRdvk7FxDLjZF94cEYrZ/b5HaWY+NIpv74wtRr5dplnKqq4M+lSbVkn6W7B M0aBDgwoh9Io18UznMYo/5ABwN1lt1B9eZGdIo X-Google-Smtp-Source: AGHT+IGQW0S5vJr0HSy4OFJM+5a7Ol3E9QrqUIJrMOeGhg2iP6B5GZGDloSCQTe4C2UvV6Obqy6eSe35/65VBr0kQac= X-Received: by 2002:a17:907:3f1f:b0:abf:749f:f719 with SMTP id a640c23a62f3a-ac252737f80mr844948366b.7.1741435944586; Sat, 08 Mar 2025 04:12:24 -0800 (PST) MIME-Version: 1.0 References: <2muwyx6a5vojkg7iegknhnkcch3lfxptsxk7icwuh7szkvvu2y@vc3ukkfvnu6i> In-Reply-To: From: Alexander Korotkov Date: Sat, 8 Mar 2025 14:12:13 +0200 X-Gm-Features: AQ5f1JpzQgiF_V2AcvPDTcWN3d-T3eMjp9au30_ipMPzRzfj-Ohy1mMPCV6L8IU Message-ID: Subject: Re: pg_atomic_compare_exchange_*() and memory barriers To: Andres Freund Cc: pgsql-hackers Content-Type: multipart/alternative; boundary="000000000000e059c6062fd3a625" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000e059c6062fd3a625 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi, Andres! On Fri, Mar 7, 2025 at 7:54=E2=80=AFPM Alexander Korotkov wrote: > On Fri, Mar 7, 2025 at 7:46=E2=80=AFPM Andres Freund = wrote: > > On 2025-03-07 19:44:20 +0200, Alexander Korotkov wrote: > > > On Fri, Mar 7, 2025 at 7:38=E2=80=AFPM Andres Freund wrote: > > > > On 2025-03-07 19:15:23 +0200, Alexander Korotkov wrote: > > > > > On Fri, Mar 7, 2025 at 7:07=E2=80=AFPM Andres Freund wrote: > > > > > > What is the access pattern and the observed problems with it that made you > > > > > > look at the disassembly? > > > > > > > > > > Check this code. > > > > > > > > > > l1: pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], NewPageEndPtr); > > > > > > > > > /* > > > > > * Try to advance XLogCtl->InitializedUpTo. > > > > > * > > > > > * If the CAS operation failed, then some of previous pages are not > > > > > * initialized yet, and this backend gives up. > > > > > * > > > > > * Since initializer of next page might give up on advancing of > > > > > * InitializedUpTo, this backend have to attempt advancing until it > > > > > * find page "in the past" or concurrent backend succeeded at > > > > > * advancing. When we finish advancing XLogCtl->InitializedUpTo, we > > > > > * notify all the waiters with XLogCtl->InitializedUpToCondVar. > > > > > */ > > > > > l2: while (pg_atomic_compare_exchange_u64(&XLogCtl->InitializedUpTo, > > > > > &NewPageBeginPtr, NewPageEndPtr)) > > > > > { > > > > > NewPageBeginPtr =3D NewPageEndPtr; > > > > > NewPageEndPtr =3D NewPageBeginPtr + XLOG_BLCKSZ; > > > > > nextidx =3D XLogRecPtrToBufIdx(NewPageBeginPtr); > > > > > > > > > > l3: if (pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) != =3D > > > > > NewPageEndPtr) > > > > > { > > > > > /* > > > > > * Page at nextidx wasn't initialized yet, so we cann't move > > > > > * InitializedUpto further. It will be moved by backend > > > > > which > > > > > * will initialize nextidx. > > > > > */ > > > > > > > > > > ConditionVariableBroadcast(&XLogCtl->InitializedUpToCondVar); > > > > > break; > > > > > } > > > > > } > > > > > > > > > > Consider the following execution order with process 1 (p1) and process 2 > > > > > (p2). > > > > > > > > On 2025-03-07 19:24:39 +0200, Alexander Korotkov wrote: > > > > > Sorry, I messed this up. > > > > > The correct sequence is following. > > > > > > > > > > 1. p1 executes l1 > > > > > 2. p1 executes l2 with failure > > > > > 3. p2 executes l2 with success > > > > > 4. p2 execute l3, but doesn't see the results of step 1, because = 3 > > > > > didn't provide enough of memory barrier > > > > > > > > Did you mean because 2) didn't provide enough of a memory barrier? Because 3) > > > > does, right? > > > > > > Yes, exactly. > > > > > > > You could get in exactly same the situation if the p1 is scheduled out by the > > > > OS after step 1, no? > > > > > > No. In that case, p1 will execute l2 with success. p1 executes l2 > > > with failure only because it goes before p2 executes l2. > > > > In my scenario p1 will not execute l2 because p2 gets scheduled before it can > > do so. So p1 cant yet execute l2 before p2 executes l2. > > In my scenario p1.l2 will be successful if executed after p2.l2 and > failed if executed before p2.l2. Imagine initial value is 0. p1 > tries to change 1=3D>2, while p2 tries to change 0 =3D> 1. My explanation was cumbersome form the beginning. Let me come with more clear example. Let us have a shared memory variable v, and shared memory flag f. The initial state is as following. v =3D 0 f =3D false Two processes p1 and p2 runs the following programs in pseudocode. Atomic compare-exchange operation is represented as CAS(variable, oldval, newval). In this example we don't need to update oldval on failure or even know whether operation is successful. p1: CAS(v, 0, 1) if f: CAS(v, 1, 2) p2: f =3D true CAS(v, 1, 2) I think if CAS() implements full barrier semantics, the invariant is that v eventually get the value 2. If CAS(v, 1, 2) by p2 goes before CAS(v, 0, 1) by p1, p1 will reliably see f =3D=3D true and apply CAS(v, 1, 2). If CAS() don't implement full barrier on failure this invariant gets broken (that is CAS() by p2 fails, but p1 see f =3D=3D false). I'm not an expert in formal specifications of memory models. But I'm quite surprised we're discussing whether memory barrier on compare-exchange failure might matter. For me at least the fact that __atomic_compare_exchange_n() have failure_memorder argument is a quite an evidence of that. ------ Regards, Alexander Korotkov Supabase --000000000000e059c6062fd3a625 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi, Andres!

On Fri, Mar 7, 2025 at 7:54=E2=80=AFPM = Alexander Korotkov <aekorotkov@g= mail.com> wrote:
> On Fri, Mar 7, 2025 at 7:46=E2=80=AFPM Andr= es Freund <andres@anarazel.de&= gt; wrote:
> > On 2025-03-07 19:44:20 +0200, Alexander Korotkov wr= ote:
> > > On Fri, Mar 7, 2025 at 7:38=E2=80=AFPM Andres Freund= <andres@anarazel.de> wrote= :
> > > > On 2025-03-07 19:15:23 +0200, Alexander Korotkov w= rote:
> > > > > On Fri, Mar 7, 2025 at 7:07=E2=80=AFPM An= dres Freund <andres@anarazel.de> wrote:
> > > > > > What is the access pattern an= d the observed problems with it that made you
> > > > > &= gt; look at the disassembly?
> > > > >
> > > = > > Check this code.
> > > > >
> > > &g= t; > l1: =C2=A0 =C2=A0 pg_atomic_write_u64(&XLogCtl->xlblocks[nex= tidx], NewPageEndPtr);
> > > >
> > > > > = =C2=A0 =C2=A0 =C2=A0 =C2=A0 /*
> > > > > =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0* Try to advance XLogCtl->InitializedUpTo.
> &= gt; > > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*
> > > &g= t; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* If the CAS operation failed, th= en some of previous pages are not
> > > > > =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0* initialized yet, and this backend gives up.
> = > > > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*
> > > &= gt; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* Since initializer of next page= might give up on advancing of
> > > > > =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0* InitializedUpTo, this backend have to attempt advanci= ng until it
> > > > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*= find page "in the past" or concurrent backend succeeded at
&g= t; > > > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* advancing.=C2=A0= When we finish advancing XLogCtl->InitializedUpTo, we
> > >= > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* notify all the waiters with = XLogCtl->InitializedUpToCondVar.
> > > > > =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0*/
> > > > > l2: =C2=A0 =C2=A0 wh= ile (pg_atomic_compare_exchange_u64(&XLogCtl->InitializedUpTo,
&g= t; > > > > &NewPageBeginPtr, NewPageEndPtr))
> > &= gt; > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 {
> > > > > =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 NewPageBeginPtr =3D NewPageEndPtr;> > > > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 NewPag= eEndPtr =3D NewPageBeginPtr + XLOG_BLCKSZ;
> > > > > =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 nextidx =3D XLogRecPtrToBufIdx(NewPa= geBeginPtr);
> > > > >
> > > > > l3: = =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (pg_atomic_read_u64(&XLogCtl->xlblock= s[nextidx]) !=3D
> > > > > NewPageEndPtr)
> > &g= t; > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 {
> > > = > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /*
>= ; > > > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0* Page at nextidx wasn't initialized yet, so we cann't mo= ve
> > > > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0* InitializedUpto further. It will be moved by backend<= br>> > > > > which
> > > > > =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* will initialize nextidx.=
> > > > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0*/
> > > > >
> > > > >= ConditionVariableBroadcast(&XLogCtl->InitializedUpToCondVar);
&g= t; > > > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 break;
> > > > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 }
> > > > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 }
> = > > > >
> > > > > Consider the following exec= ution order with process 1 (p1) and process 2
> > > > > (= p2).
> > > >
> > > > On 2025-03-07 19:24:39 += 0200, Alexander Korotkov wrote:
> > > > > Sorry, I messed= this up.
> > > > > The correct sequence is following.> > > > >
> > > > > 1. p1 executes l1
= > > > > > 2. p1 executes l2 with failure
> > > &= gt; > 3. p2 executes l2 with success
> > > > > 4. p2 e= xecute l3, but doesn't see the results of step 1, because 3
> >= ; > > > didn't provide enough of memory barrier
> > &= gt; >
> > > > Did you mean because 2) didn't provide = enough of a memory barrier? Because 3)
> > > > does, right?<= br>> > >
> > > Yes, exactly.
> > >
>= > > > You could get in exactly same the situation if the p1 is sc= heduled out by the
> > > > OS after step 1, no?
> >= >
> > > No. In that case, p1 will execute l2 with success. = =C2=A0p1 executes l2
> > > with failure only because it goes be= fore p2 executes l2.
> >
> > In my scenario p1 will not e= xecute l2 because p2 gets scheduled before it can
> > do so. So p1= cant yet execute l2 before p2 executes l2.
>
> In my scenario = p1.l2 will be successful if executed after p2.l2 and
> failed if exec= uted before p2.l2.=C2=A0 Imagine initial value is 0. =C2=A0p1
> tries= to change 1=3D>2, while p2 tries to change 0 =3D> 1.

My expla= nation was cumbersome form the beginning.=C2=A0 Let me come with more clear= example.

Let us have a shared memory variable v, and shared memory = flag f.=C2=A0 The initial state is as following.
v =3D 0
f =3D false<= br>
Two processes p1 and p2 runs the following programs in pseudocode.= =C2=A0 Atomic compare-exchange operation is represented as CAS(variable, ol= dval, newval).=C2=A0 In this example we don't need to update oldval on = failure or even know whether operation is successful.

p1:
CAS(v, 0, 1)
if f:
=C2=A0 =C2=A0CAS(v, 1, 2)
p2:
f =3D true
CAS(v, 1, 2)

I think if CAS() implement= s full barrier semantics, the invariant is that v eventually get the value = 2.=C2=A0 If CAS(v, 1, 2) by p2 goes before CAS(v, 0, 1) by p1, p1 will reli= ably see f =3D=3D true and apply CAS(v, 1, 2).

If CAS() = don't implement full barrier on failure this invariant gets broken (tha= t is CAS() by p2 fails, but p1 see f =3D=3D false).

I= 9;m not an expert in formal specifications of memory models.=C2=A0 But I= 9;m quite surprised we're discussing whether memory barrier on compare-= exchange failure might matter.=C2=A0 For me at least the fact that=C2=A0__a= tomic_compare_exchange_n() have=C2=A0failure_memorder argument is a quite a= n evidence of that.

------
Regards,
Alexander KorotkovSupabase
--000000000000e059c6062fd3a625--