MIME-Version: 1.0
References: <CAD=mzVXR3GjM0vcthMBwEdbOKqSKcv8oojSS9coczWRi9BRYTA@mail.gmail.com>
 <abd3bc064d16bc93a2d8661a692903da97d2c154.camel@cybertec.at>
In-Reply-To: <abd3bc064d16bc93a2d8661a692903da97d2c154.camel@cybertec.at>
From: sud <suds1434@gmail.com>
Date: Thu, 23 May 2024 13:41:31 +0530
Message-ID: <CAD=mzVVvK8xk-9m8h3Xu27cGN7BW329HKYdO+0EMXfWvSD3AGA@mail.gmail.com>
Subject: Re: Long running query causing XID limit breach
To: Laurenz Albe <laurenz.albe@cybertec.at>
Cc: pgsql-general <pgsql-general@lists.postgresql.org>
Content-Type: multipart/alternative; boundary="000000000000e9f8aa06191a99dc"
Archived-At: <https://www.postgresql.org/message-id/CAD%3DmzVVvK8xk-9m8h3Xu27cGN7BW329HKYdO%2B0EMXfWvSD3AGA%40mail.gmail.com>
Precedence: bulk

--000000000000e9f8aa06191a99dc
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Thu, May 23, 2024 at 1:22=E2=80=AFPM Laurenz Albe <laurenz.albe@cybertec=
.at>
wrote:

> On Thu, 2024-05-23 at 02:46 +0530, sud wrote:
> > It's RDS postgres version 15.4. We suddenly saw the
> "MaximumUsedTransactionIDs"
> > reach to ~1.5billion and got alerted by team members who mentioned the
> database
> > is going to be in shutdown/hung if this value reaches to ~2billion and
> won't be
> > able to serve any incoming transactions. It was a panic situation.
> >
> > I have heard of it before , because of the way postgres works and the
> XID being
> > a datatype of length 32 bit integer can only represent (2^32)/2=3D~2
> billion
> > transactions. However, as RDS performs the auto vacuum , we thought tha=
t
> we need
> > not worry about this issue. But it seems we were wrong. And we found on=
e
> adhoc
> > "SELECT '' query was running on the reader instance since the last
> couple of
> > days and when that was killed, the max xid (MaximumUsedTransactionIDs)
> dropped
> > to 50million immediately.
>
> This has nothing to do with autovacuum running.
> PostgreSQL won't freeze any rows above the xmin horizon (see the
> "backend_xmin"
> column in "pg_stat_activity").
>
> > So I have few questions,
> >
> > 1)This system is going to be a 24/7 up and running system which will
> process
> >   ~500million business transactions/day in future i.e. ~4-5billion
> rows/day
> >   inserted across multiple tables each day. And as I understand each ro=
w
> will
> >   have XID allocated. So in that case , does it mean that, we will need
> >   (5billion/24)=3D~200million XID/hour and thus , if any such legitimat=
e
> >   application "SELECT" query keeps running for ~10 hours (and thus keep
> the
> >   historical XID alive) , then it can saturate the
> "MaximumUsedTransactionIDs"
> >   and make the database standstill in 2billion/200million=3D~10hrs. Is =
this
> >   understanding correct? Seems we are prone to hit this limit sooner
> going forward.
>
> Yes, that is correct.  You cannot run such long-running queries with a
> transaction rate like that.
>
>
When you mean transaction ,does it mean one commit ? For example if it's
inserting+committing ~1000 rows in one batch then all the 1000 rows will be
marked as one XID rather than 1000 different XID. and so we should look for
batch processing rather than row by row types processing. Is the
understanding correct?


> One thing you could consider is running the long-running queries on a
> standby
> server.  Replication will get delayed, and you have to keep all the WAL
> around for the standby to catch up once the query is done, but it should
> work.
> You'd set "max_streaming_standby_delay" to -1 on the standby.
>
>
We have the "Select query" running on a reader instance , but still the
writer instance was showing up "MaximumUsedTransactionIDs" reaching
1.5billion, so it means both the instance as part of same cluster so
sharing same XIDs, and as per your suggestion we should run this in
separate standby cluster altogether which does not share same XID. Is this
understanding correct? or it can be handled even with another reader
instance by just tweaking some other parameter so that they won't share the
same XID?

--000000000000e9f8aa06191a99dc
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><div class=3D"gmail_quote"><div=
 dir=3D"ltr" class=3D"gmail_attr">On Thu, May 23, 2024 at 1:22=E2=80=AFPM L=
aurenz Albe &lt;<a href=3D"mailto:laurenz.albe@cybertec.at">laurenz.albe@cy=
bertec.at</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding=
-left:1ex">On Thu, 2024-05-23 at 02:46 +0530, sud wrote:<br>
&gt; It&#39;s RDS postgres version 15.4. We suddenly saw the &quot;MaximumU=
sedTransactionIDs&quot;<br>
&gt; reach to ~1.5billion and got alerted by team members who mentioned the=
 database<br>
&gt; is going to be in shutdown/hung if this value reaches to ~2billion and=
 won&#39;t be<br>
&gt; able to serve any incoming transactions. It was a panic situation.<br>
&gt; <br>
&gt; I have heard of it before , because of the way postgres works and the =
XID being<br>
&gt; a datatype of length 32 bit integer can only represent (2^32)/2=3D~2 b=
illion<br>
&gt; transactions. However, as RDS performs the auto vacuum , we thought th=
at we need<br>
&gt; not worry about this issue. But it seems we were wrong. And we found o=
ne adhoc<br>
&gt; &quot;SELECT &#39;&#39; query was running on the reader instance since=
 the last couple of<br>
&gt; days and when that was killed, the max xid (MaximumUsedTransactionIDs)=
 dropped<br>
&gt; to 50million immediately.<br>
<br>
This has nothing to do with autovacuum running.<br>
PostgreSQL won&#39;t freeze any rows above the xmin horizon (see the &quot;=
backend_xmin&quot;<br>
column in &quot;pg_stat_activity&quot;).<br>
<br>
&gt; So I have few questions,<br>
&gt; <br>
&gt; 1)This system is going to be a 24/7 up and running system which will p=
rocess<br>
&gt;=C2=A0 =C2=A0~500million business transactions/day in future i.e. ~4-5b=
illion rows/day<br>
&gt;=C2=A0 =C2=A0inserted across multiple tables each day. And as I underst=
and each row will<br>
&gt;=C2=A0 =C2=A0have XID allocated. So in that case , does it mean that, w=
e will need<br>
&gt;=C2=A0 =C2=A0(5billion/24)=3D~200million XID/hour and thus , if any suc=
h legitimate<br>
&gt;=C2=A0 =C2=A0application &quot;SELECT&quot; query keeps running for ~10=
 hours (and thus keep the<br>
&gt;=C2=A0 =C2=A0historical XID alive) , then it can saturate the &quot;Max=
imumUsedTransactionIDs&quot;<br>
&gt;=C2=A0 =C2=A0and make the database standstill in 2billion/200million=3D=
~10hrs. Is this<br>
&gt;=C2=A0 =C2=A0understanding correct? Seems we are prone to hit this limi=
t sooner going forward.<br>
<br>
Yes, that is correct.=C2=A0 You cannot run such long-running queries with a=
<br>
transaction rate like that.<br>
<br></blockquote><div><br></div><div>When you mean transaction ,does it mea=
n one commit ? For example if it&#39;s inserting+committing ~1000 rows in o=
ne batch then all the 1000 rows will be marked as one XID rather than 1000 =
different XID. and so we should look for batch processing rather than row b=
y row types processing. Is the understanding correct?</div><div>=C2=A0</div=
><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border=
-left:1px solid rgb(204,204,204);padding-left:1ex">
One thing you could consider is running the long-running queries on a stand=
by<br>
server.=C2=A0 Replication will get delayed, and you have to keep all the WA=
L<br>
around for the standby to catch up once the query is done, but it should wo=
rk.<br>
You&#39;d set &quot;max_streaming_standby_delay&quot; to -1 on the standby.=
<br><br></blockquote><div><br></div><div>We have the &quot;Select query&quo=
t; running on a reader instance , but still the writer instance was showing=
 up &quot;MaximumUsedTransactionIDs&quot; reaching 1.5billion, so it means =
both the instance as part of same cluster so sharing same XIDs, and as per =
your suggestion we should run this in separate standby cluster altogether=
=C2=A0which does not share same XID. Is this understanding correct? or it c=
an be handled even with another reader instance by just tweaking some other=
 parameter so that they won&#39;t share the same XID?</div></div></div>

--000000000000e9f8aa06191a99dc--