MIME-Version: 1.0
References: <CACKLY6hwMnSjjipZ57TTz_FC9aWBkg7OpNLJ1tu+GvfEFc4hJA@mail.gmail.com>
 <8a534c5f-e400-4bb5-b39e-2017d259ff06@aklaver.com> <CACKLY6hYp+9W9xijXFh_UEpDuoo7bxg-A3deBKcEzEub3K+Kfg@mail.gmail.com>
 <10bcc03d-fa3e-40b6-bdd2-cac0acd046f4@aklaver.com>
In-Reply-To: <10bcc03d-fa3e-40b6-bdd2-cac0acd046f4@aklaver.com>
From: Daniel McKenzie <daniel.mckenzie@curvedental.com>
Date: Thu, 9 May 2024 08:32:36 +0100
Message-ID: <CACKLY6hNLT8qQWjmMChFnALr3UFEgzCxbTFbckfWd6GjHyn7Sw@mail.gmail.com>
Subject: Re: Unexpected data when subscribing to logical replication slot
To: Adrian Klaver <adrian.klaver@aklaver.com>, tomas.vondra@enterprisedb.com
Cc: pgsql-general@postgresql.org
Content-Type: multipart/alternative; boundary="000000000000dfffd40618006cb2"
Archived-At: <https://www.postgresql.org/message-id/CACKLY6hNLT8qQWjmMChFnALr3UFEgzCxbTFbckfWd6GjHyn7Sw%40mail.gmail.com>
Precedence: bulk

--000000000000dfffd40618006cb2
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

>
> Asynchronous commit introduces the risk of data loss. There is a short
> time window between the report of transaction completion to the client
> and the time that the transaction is truly committed.


The documentation speaks about synchronous_commit changing how transactions
change behaviour for the client. So in this case, my psql terminal is the
client, and I would expect a faster commit (from its perspective) and then
a period of risk (as a process usually done as part of the commit is now
being done in the background) but it's not clear how that affects a
replication slot subscriber.

What we're struggling to understand is: why are we seeing any updates in
the replication slot before they have been "truly committed"?

There appears to be a state of limbo between updating data and that data
being available to query (and our subscriber is picking up changes in this
period of time) but I can't pin down any documentation which describes it.

We've had this running in live now for years without a hiccup so we are
surprised to learn that we have this massive race condition and it just so
happens that the hardware is fast enough to process the transaction before
the .NET application can react to replication slot changes.

Daniel McKenzie
Software Developer

Office: +1 403.910.5927 x 251
Mobile: +44 7712 159045
Website: www.curvedental.com

*Curve Dental Confidentiality Notice*
This message is intended exclusively for the individual or entity to which
it is addressed. This communication may contain information that is
proprietary, privileged, confidential, or otherwise legally exempt from
disclosure. If you are not the named addressee, you are not authorized to
read, print, retain, copy, or disseminate this message or any part of it.
If you have received this message in error, please notify the sender
immediately by replying to this e-mail and delete all copies of this
message.


On Wed, May 8, 2024 at 5:28=E2=80=AFPM Adrian Klaver <adrian.klaver@aklaver=
.com>
wrote:

> On 5/8/24 08:24, Daniel McKenzie wrote:
> > It's running both (in docker containers) and also quite a few more
> > docker containers running various .NET applications.
>
> I think what you found is that the r7a.medium instance is not capable
> enough to do all that it is asked without introducing lag under load.
> Answering the questions posed by Tomas Vondra would help get to the
> actual cause of the lag.
>
> In meantime my suspicion is this part:
>
> "For example, when I use a psql terminal to update a user's last name
> from "Jones" to "Smith" then I would expect the enrichment query to find
> "Smith" but it will sometimes still find "Jones". It finds the old data
> perhaps 1 in 50 times."
>
> If this is being run against the Postgres server my guess is that
> synchronous_commit=3Don is causing the commit on the server to wait for
> the WAL records to be flushed to disk and this is not happening in a
> timely manner in the '... 1 in 50 times' you mention. In that case you
> see the old values not the new committed values. This seems to be
> confirmed when you set synchronous_commit=3Doff and don't see old values.
> For completeness per:
>
> https://www.postgresql.org/docs/current/wal-async-commit.html
>
> "However, for short transactions this delay is a major component of the
> total transaction time. Selecting asynchronous commit mode means that
> the server returns success as soon as the transaction is logically
> completed, before the WAL records it generated have actually made their
> way to disk. This can provide a significant boost in throughput for
> small transactions.
>
> Asynchronous commit introduces the risk of data loss. There is a short
> time window between the report of transaction completion to the client
> and the time that the transaction is truly committed (that is, it is
> guaranteed not to be lost if the server crashes).  ...
> "
>
> >
> > Daniel McKenzie
> > Software Developer
>
> --
> Adrian Klaver
> adrian.klaver@aklaver.com
>
>

--000000000000dfffd40618006cb2
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px =
0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Asynchro=
nous commit introduces the risk of data loss. There is a short<br>time wind=
ow between the report of transaction completion to the client<br>and the ti=
me that the transaction is truly committed.</blockquote><div><br></div><div=
><div>The documentation speaks about=C2=A0synchronous_commit changing how t=
ransactions change behaviour for the client. So in this case, my psql termi=
nal is the client, and I would expect a faster commit (from its perspective=
) and then a period of risk (as a process usually done as part of the commi=
t is now being done in the background) but it&#39;s not clear how that affe=
cts a replication slot subscriber.</div></div><div><br></div><div><div>What=
 we&#39;re struggling to understand is: why are we seeing any updates in th=
e replication slot before they have been &quot;truly committed&quot;?<br></=
div><div><br></div><div>There appears to be a state of limbo between updati=
ng data and that data being available to query (and our subscriber is picki=
ng up changes in this period of time) but I can&#39;t pin down any document=
ation which describes it.</div></div><div><br></div><div>We&#39;ve had this=
 running in live now for years without a hiccup so we are surprised to lear=
n that we have this massive race condition and it just so happens that the =
hardware is fast enough to process the transaction before the .NET applicat=
ion can react to replication slot changes.</div><div><br></div><div><span s=
tyle=3D"color:rgb(0,0,0)">Daniel McKenzie</span></div><div><div dir=3D"ltr"=
 class=3D"gmail_signature" data-smartmail=3D"gmail_signature"><div dir=3D"l=
tr"><div><font color=3D"#000000">Software Developer</font></div><div><font =
color=3D"#000000"><br></font></div><div><div style=3D"color:rgb(34,34,34)">=
Office: +1 403.910.5927 x 251</div><div style=3D"color:rgb(34,34,34)">Mobil=
e: +44 7712 159045</div></div><div><font color=3D"#000000">Website:=C2=A0<a=
 href=3D"http://www.curvedental.com/" target=3D"_blank">www.curvedental.com=
</a></font></div><div><div style=3D"color:rgb(34,34,34)"><span><img width=
=3D"414px;" height=3D"204px;" src=3D"https://lh4.googleusercontent.com/kP-t=
ftTVPqEWSyAs7smdyApYfhJjzsj0rYBa6N78OWEqemCvkLQuMeq9p5msU04kDWe1WC8FyvKdsmY=
4Cj5MSYFWpQeRNW1nO0mArl2KWrpZEd-KMRI4R9Efiq_MpDMD"></span><br></div><div st=
yle=3D"color:rgb(34,34,34)"><div><u><b><font size=3D"1">Curve Dental Confid=
entiality Notice</font></b></u></div><div><font size=3D"1">This message is =
intended exclusively for the individual or entity to which it is addressed.=
 This communication may=C2=A0</font><span style=3D"font-size:x-small">conta=
in information that is proprietary, privileged, confidential, or otherwise =
legally exempt from disclosure. If you are=C2=A0</span><span style=3D"font-=
size:x-small">not the named addressee, you are not authorized to read, prin=
t, retain, copy, or disseminate this message or any=C2=A0</span><span style=
=3D"font-size:x-small">part of it. If you have received this message in err=
or, please notify the sender immediately by replying to this e-mail=C2=A0</=
span><span style=3D"font-size:x-small">and delete all copies of this messag=
e.</span></div></div></div></div></div></div><br></div><br><div class=3D"gm=
ail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Wed, May 8, 2024 at 5:2=
8=E2=80=AFPM Adrian Klaver &lt;<a href=3D"mailto:adrian.klaver@aklaver.com"=
>adrian.klaver@aklaver.com</a>&gt; wrote:<br></div><blockquote class=3D"gma=
il_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,2=
04,204);padding-left:1ex">On 5/8/24 08:24, Daniel McKenzie wrote:<br>
&gt; It&#39;s running both (in docker containers) and also quite a few more=
 <br>
&gt; docker containers running various .NET applications.<br>
<br>
I think what you found is that the r7a.medium instance is not capable <br>
enough to do all that it is asked without introducing lag under load. <br>
Answering the questions posed by Tomas Vondra would help get to the <br>
actual cause of the lag.<br>
<br>
In meantime my suspicion is this part:<br>
<br>
&quot;For example, when I use a psql terminal to update a user&#39;s last n=
ame <br>
from &quot;Jones&quot; to &quot;Smith&quot; then I would expect the enrichm=
ent query to find <br>
&quot;Smith&quot; but it will sometimes still find &quot;Jones&quot;. It fi=
nds the old data <br>
perhaps 1 in 50 times.&quot;<br>
<br>
If this is being run against the Postgres server my guess is that <br>
synchronous_commit=3Don is causing the commit on the server to wait for <br=
>
the WAL records to be flushed to disk and this is not happening in a <br>
timely manner in the &#39;... 1 in 50 times&#39; you mention. In that case =
you <br>
see the old values not the new committed values. This seems to be <br>
confirmed when you set synchronous_commit=3Doff and don&#39;t see old value=
s.<br>
For completeness per:<br>
<br>
<a href=3D"https://www.postgresql.org/docs/current/wal-async-commit.html" r=
el=3D"noreferrer" target=3D"_blank">https://www.postgresql.org/docs/current=
/wal-async-commit.html</a><br>
<br>
&quot;However, for short transactions this delay is a major component of th=
e <br>
total transaction time. Selecting asynchronous commit mode means that <br>
the server returns success as soon as the transaction is logically <br>
completed, before the WAL records it generated have actually made their <br=
>
way to disk. This can provide a significant boost in throughput for <br>
small transactions.<br>
<br>
Asynchronous commit introduces the risk of data loss. There is a short <br>
time window between the report of transaction completion to the client <br>
and the time that the transaction is truly committed (that is, it is <br>
guaranteed not to be lost if the server crashes).=C2=A0 ...<br>
&quot;<br>
<br>
&gt; <br>
&gt; Daniel McKenzie<br>
&gt; Software Developer<br>
<br>
-- <br>
Adrian Klaver<br>
<a href=3D"mailto:adrian.klaver@aklaver.com" target=3D"_blank">adrian.klave=
r@aklaver.com</a><br>
<br>
</blockquote></div>

--000000000000dfffd40618006cb2--