MIME-Version: 1.0
References: <CAD=mzVXR3GjM0vcthMBwEdbOKqSKcv8oojSS9coczWRi9BRYTA@mail.gmail.com>
In-Reply-To: <CAD=mzVXR3GjM0vcthMBwEdbOKqSKcv8oojSS9coczWRi9BRYTA@mail.gmail.com>
From: Muhammad Salahuddin Manzoor <salahuddin.m@bitnine.net>
Date: Thu, 23 May 2024 08:29:55 +0500
Message-ID: <CAKD7CDk=mB3Z2m9hLK=bX1=KThUwOE+yudmOEaBb6Grqg8HXaQ@mail.gmail.com>
Subject: Re: Long running query causing XID limit breach
To: sud <suds1434@gmail.com>
Cc: pgsql-general <pgsql-general@lists.postgresql.org>
Content-Type: multipart/alternative; boundary="000000000000d92fc2061916aaf6"
Archived-At: <https://www.postgresql.org/message-id/CAKD7CDk%3DmB3Z2m9hLK%3DbX1%3DKThUwOE%2ByudmOEaBb6Grqg8HXaQ%40mail.gmail.com>
Precedence: bulk

--000000000000d92fc2061916aaf6
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Greetings,

In high-transaction environments like yours, it may be necessary to
supplement this with manual vacuuming.

Few Recommendations

Monitor Long-Running Queries try to optimize.
Optimize Autovacuum.
Partitioning.
Adopt Vacuum Strategy after peak hours.

*Salahuddin (=EC=82=B4=EB=9D=BC=ED=9B=84=EB=94=98**)*


On Thu, 23 May 2024 at 02:16, sud <suds1434@gmail.com> wrote:

> Hello ,
> It's RDS postgres version 15.4. We suddenly saw the
> "MaximumUsedTransactionIDs" reach to ~1.5billion and got alerted by team
> members who mentioned the database is going to be in shutdown/hung if thi=
s
> value reaches to ~2billion and won't be able to serve any incoming
> transactions. It was a panic situation.
>
> I have heard of it before , because of the way postgres works and the XID
> being a datatype of length 32 bit integer can only represent (2^32)/2=3D~=
2
> billion transactions. However, as RDS performs the auto vacuum , we thoug=
ht
> that we need not worry about this issue. But it seems we were wrong. And =
we
> found one adhoc "SELECT '' query was running on the reader instance since
> the last couple of days and when that was killed, the max xid
> (MaximumUsedTransactionIDs) dropped to 50million immediately.
>
> So I have few questions,
>
> 1)This system is going to be a 24/7 up and running system which will
> process ~500million business transactions/day in future i.e. ~4-5billion
> rows/day inserted across multiple tables each day. And as I understand ea=
ch
> row will have XID allocated. So in that case , does it mean that, we will
> need (5billion/24)=3D~200million XID/hour and thus , if any such legitima=
te
> application "SELECT" query keeps running for ~10 hours (and thus keep the
> historical XID alive) , then it can saturate the
> "MaximumUsedTransactionIDs" and make the database standstill in
> 2billion/200million=3D~10hrs. Is this understanding correct? Seems we are
> prone to hit this limit sooner going forward.
>
> 2)We have some legitimate cases where the reporting queries can run for
> 5-6hrs. So in such cases if the start of this SELECT query happen at 100t=
h
> XID on table TAB1, then whatever transactions happen after that time,
> across all other tables(table2, table3 etc) in the database won't get
> vacuum until that SELECT query on table1 get vacuumed(as database will tr=
y
> to keep that same 100th XID image) and the XID will just keep incrementin=
g
> for new transaction, eventually reaching the max limit. Is my understandi=
ng
> correct here?
>
> 3)Although RDS does the auto vacuum by default. but should we also
> consider doing manual vacuum without impacting ongoing transactions?
> Something as below options
> vacuum freeze tab1;
> vacuum freeze;
> vacuum;
> vacuum analyze tab1;
> vacuum tab1;
>
> 4)Had worked in past in oracle database where the similar transaction
> identifier is called as "system change number" , but never encountered th=
at
> being exhausted and also there it used to have UNDO record and if a SELEC=
T
> query needs anything beyond certain limit(set undo_retention parameter) t=
he
> select query used to fail with snapshot too old error but not impacting a=
ny
> write transactions. But in postgres it seems nothing like that happens an=
d
> every "Select query" will try to run till its completion without any such
> failure, until it gets skilled by someone. Is my understanding correct?
>
>  And in that case, It seems we have to mandatorily set "statement_timeout=
"
> to some value e.g. 4hrs(also i am not seeing a way to set it for any
> specific user level, so it will be set for all queries including
> application level) and also "idle_in_transaction_session_timeout" to
> 5minutes, even on all the prod and non prod databases, to restrict the lo=
ng
> running transactions/queries and avoid such issues in future. Correct me =
if
> I'm wrong.
>
> Regards
> Sud
>

--000000000000d92fc2061916aaf6
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Greetings,<br><div><br></div><div>In high-transaction envi=
ronments like yours, it may be necessary to supplement this with manual vac=
uuming.<br><br>Few Recommendations<br><br>Monitor Long-Running Queries try =
to optimize.<br>Optimize Autovacuum.<br>Partitioning.<br>Adopt Vacuum Strat=
egy after peak hours.<br clear=3D"all"><div><div dir=3D"ltr" class=3D"gmail=
_signature" data-smartmail=3D"gmail_signature"><div dir=3D"ltr"><table widt=
h=3D"520px" height=3D"100px" cellspacing=3D"0" cellpadding=3D"0" style=3D"f=
ont-size:13px;border:1px solid rgb(255,255,255);color:rgb(51,51,51)"><tbody=
><tr><td style=3D"border:1px solid rgb(255,255,255)"><br></td><td style=3D"=
border:1px solid rgb(255,255,255)"><p dir=3D"ltr" style=3D"line-height:1.2;=
margin-top:6pt;margin-bottom:6pt"><font color=3D"#2255ff" face=3D"arial, sa=
ns-serif" size=3D"2"><b>Salahuddin (=EC=82=B4=EB=9D=BC=ED=9B=84=EB=94=98</b=
></font><b style=3D"color:rgb(34,85,255);font-family:arial,sans-serif;font-=
size:small">)</b></p><p dir=3D"ltr" style=3D"line-height:1.2;margin-top:6pt=
;margin-bottom:6pt"><br></p></td></tr></tbody></table></div></div></div><br=
></div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail=
_attr">On Thu, 23 May 2024 at 02:16, sud &lt;<a href=3D"mailto:suds1434@gma=
il.com">suds1434@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gma=
il_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,2=
04,204);padding-left:1ex"><div dir=3D"ltr">Hello ,=C2=A0<br>It&#39;s RDS po=
stgres version 15.4. We suddenly saw the &quot;MaximumUsedTransactionIDs&qu=
ot; reach to ~1.5billion and got alerted by team members who mentioned the =
database is going to be in shutdown/hung if this value reaches to ~2billion=
 and won&#39;t be able to serve any incoming transactions. It was a panic s=
ituation.<br><br>I have heard of it before , because of the way postgres wo=
rks and the XID being a datatype of length 32 bit integer can only represen=
t (2^32)/2=3D~2 billion transactions. However, as RDS performs the auto vac=
uum , we thought that we need not worry about this issue. But it seems we w=
ere wrong. And we found one adhoc &quot;SELECT &#39;&#39; query was running=
 on the reader instance since the last couple of days and when that was kil=
led, the max xid (MaximumUsedTransactionIDs) dropped to 50million immediate=
ly.=C2=A0<br><br><div>So I have few questions,<br><br>1)This system is goin=
g to be a 24/7 up and running system which will process ~500million busines=
s transactions/day in future i.e. ~4-5billion rows/day inserted across mult=
iple tables each day. And as I understand each row will have XID allocated.=
 So in that case , does it mean that, we will need (5billion/24)=3D~200mill=
ion XID/hour and thus , if any such legitimate application &quot;SELECT&quo=
t; query keeps running for ~10 hours (and thus keep the historical XID aliv=
e) , then it can saturate the &quot;MaximumUsedTransactionIDs&quot; and mak=
e the database standstill in 2billion/200million=3D~10hrs. Is this understa=
nding correct? Seems we are prone to hit this limit sooner going forward.<b=
r><br>2)We have some legitimate cases where the reporting queries can run f=
or 5-6hrs. So in such cases if the start of this SELECT query happen at 100=
th XID on table TAB1, then whatever transactions happen after that time, ac=
ross all other tables(table2, table3 etc) in the database won&#39;t get vac=
uum until that SELECT query on table1 get vacuumed(as database will try to =
keep that same 100th XID image) and the XID will just keep incrementing for=
 new transaction, eventually reaching the max limit. Is my understanding co=
rrect here?<br><br>3)Although RDS does the auto vacuum by default. but shou=
ld we also consider doing manual vacuum without impacting ongoing transacti=
ons? Something as below options<br>vacuum freeze tab1;<br>vacuum freeze; <b=
r>vacuum;<br>vacuum analyze tab1;<br>vacuum tab1;<br><br>4)Had worked in pa=
st in oracle database where the similar transaction identifier is called as=
 &quot;system change number&quot; , but never encountered that being exhaus=
ted and also there it used to have UNDO record and if a SELECT query needs =
anything beyond certain limit(set undo_retention=C2=A0parameter) the select=
 query used to fail with snapshot too old error but not impacting any write=
 transactions. But in postgres it seems nothing like that happens and every=
 &quot;Select query&quot; will try to run till its completion without any s=
uch failure,=C2=A0until it gets skilled by someone. Is my understanding cor=
rect?</div><div><br></div><div>=C2=A0And in that case, It seems we have to =
mandatorily set &quot;statement_timeout&quot; to some=C2=A0value e.g. 4hrs(=
also i am not seeing a way to set it for any specific user level, so it wil=
l be set for all queries including application level) and also &quot;idle_i=
n_transaction_session_timeout&quot; to 5minutes, even on all the prod and n=
on prod databases, to restrict the long running transactions/queries and av=
oid such issues in future. Correct me if I&#39;m wrong.</div><div><div>=C2=
=A0<br></div><div>Regards</div><div>Sud</div></div></div>
</blockquote></div>

--000000000000d92fc2061916aaf6--