MIME-Version: 1.0
From: sud <suds1434@gmail.com>
Date: Thu, 23 May 2024 02:46:31 +0530
Message-ID: <CAD=mzVXR3GjM0vcthMBwEdbOKqSKcv8oojSS9coczWRi9BRYTA@mail.gmail.com>
Subject: Long running query causing XID limit breach
To: pgsql-general <pgsql-general@lists.postgresql.org>
Content-Type: multipart/alternative; boundary="00000000000084b0b006191173ed"
Archived-At: <https://www.postgresql.org/message-id/CAD%3DmzVXR3GjM0vcthMBwEdbOKqSKcv8oojSS9coczWRi9BRYTA%40mail.gmail.com>
Precedence: bulk

--00000000000084b0b006191173ed
Content-Type: text/plain; charset="UTF-8"

Hello ,
It's RDS postgres version 15.4. We suddenly saw the
"MaximumUsedTransactionIDs" reach to ~1.5billion and got alerted by team
members who mentioned the database is going to be in shutdown/hung if this
value reaches to ~2billion and won't be able to serve any incoming
transactions. It was a panic situation.

I have heard of it before , because of the way postgres works and the XID
being a datatype of length 32 bit integer can only represent (2^32)/2=~2
billion transactions. However, as RDS performs the auto vacuum , we thought
that we need not worry about this issue. But it seems we were wrong. And we
found one adhoc "SELECT '' query was running on the reader instance since
the last couple of days and when that was killed, the max xid
(MaximumUsedTransactionIDs) dropped to 50million immediately.

So I have few questions,

1)This system is going to be a 24/7 up and running system which will
process ~500million business transactions/day in future i.e. ~4-5billion
rows/day inserted across multiple tables each day. And as I understand each
row will have XID allocated. So in that case , does it mean that, we will
need (5billion/24)=~200million XID/hour and thus , if any such legitimate
application "SELECT" query keeps running for ~10 hours (and thus keep the
historical XID alive) , then it can saturate the
"MaximumUsedTransactionIDs" and make the database standstill in
2billion/200million=~10hrs. Is this understanding correct? Seems we are
prone to hit this limit sooner going forward.

2)We have some legitimate cases where the reporting queries can run for
5-6hrs. So in such cases if the start of this SELECT query happen at 100th
XID on table TAB1, then whatever transactions happen after that time,
across all other tables(table2, table3 etc) in the database won't get
vacuum until that SELECT query on table1 get vacuumed(as database will try
to keep that same 100th XID image) and the XID will just keep incrementing
for new transaction, eventually reaching the max limit. Is my understanding
correct here?

3)Although RDS does the auto vacuum by default. but should we also consider
doing manual vacuum without impacting ongoing transactions? Something as
below options
vacuum freeze tab1;
vacuum freeze;
vacuum;
vacuum analyze tab1;
vacuum tab1;

4)Had worked in past in oracle database where the similar transaction
identifier is called as "system change number" , but never encountered that
being exhausted and also there it used to have UNDO record and if a SELECT
query needs anything beyond certain limit(set undo_retention parameter) the
select query used to fail with snapshot too old error but not impacting any
write transactions. But in postgres it seems nothing like that happens and
every "Select query" will try to run till its completion without any such
failure, until it gets skilled by someone. Is my understanding correct?

 And in that case, It seems we have to mandatorily set "statement_timeout"
to some value e.g. 4hrs(also i am not seeing a way to set it for any
specific user level, so it will be set for all queries including
application level) and also "idle_in_transaction_session_timeout" to
5minutes, even on all the prod and non prod databases, to restrict the long
running transactions/queries and avoid such issues in future. Correct me if
I'm wrong.

Regards
Sud

--00000000000084b0b006191173ed
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hello ,=C2=A0<br>It&#39;s RDS postgres version 15.4. We su=
ddenly saw the &quot;MaximumUsedTransactionIDs&quot; reach to ~1.5billion a=
nd got alerted by team members who mentioned the database is going to be in=
 shutdown/hung if this value reaches to ~2billion and won&#39;t be able to =
serve any incoming transactions. It was a panic situation.<br><br>I have he=
ard of it before , because of the way postgres works and the XID being a da=
tatype of length 32 bit integer can only represent (2^32)/2=3D~2 billion tr=
ansactions. However, as RDS performs the auto vacuum , we thought that we n=
eed not worry about this issue. But it seems we were wrong. And we found on=
e adhoc &quot;SELECT &#39;&#39; query was running on the reader instance si=
nce the last couple of days and when that was killed, the max xid (MaximumU=
sedTransactionIDs) dropped to 50million immediately.=C2=A0<br><br><div>So I=
 have few questions,<br><br>1)This system is going to be a 24/7 up and runn=
ing system which will process ~500million business transactions/day in futu=
re i.e. ~4-5billion rows/day inserted across multiple tables each day. And =
as I understand each row will have XID allocated. So in that case , does it=
 mean that, we will need (5billion/24)=3D~200million XID/hour and thus , if=
 any such legitimate application &quot;SELECT&quot; query keeps running for=
 ~10 hours (and thus keep the historical XID alive) , then it can saturate =
the &quot;MaximumUsedTransactionIDs&quot; and make the database standstill =
in 2billion/200million=3D~10hrs. Is this understanding correct? Seems we ar=
e prone to hit this limit sooner going forward.<br><br>2)We have some legit=
imate cases where the reporting queries can run for 5-6hrs. So in such case=
s if the start of this SELECT query happen at 100th XID on table TAB1, then=
 whatever transactions happen after that time, across all other tables(tabl=
e2, table3 etc) in the database won&#39;t get vacuum until that SELECT quer=
y on table1 get vacuumed(as database will try to keep that same 100th XID i=
mage) and the XID will just keep incrementing for new transaction, eventual=
ly reaching the max limit. Is my understanding correct here?<br><br>3)Altho=
ugh RDS does the auto vacuum by default. but should we also consider doing =
manual vacuum without impacting ongoing transactions? Something as below op=
tions<br>vacuum freeze tab1;<br>vacuum freeze; <br>vacuum;<br>vacuum analyz=
e tab1;<br>vacuum tab1;<br><br>4)Had worked in past in oracle database wher=
e the similar transaction identifier is called as &quot;system change numbe=
r&quot; , but never encountered that being exhausted and also there it used=
 to have UNDO record and if a SELECT query needs anything beyond certain li=
mit(set undo_retention=C2=A0parameter) the select query used to fail with s=
napshot too old error but not impacting any write transactions. But in post=
gres it seems nothing like that happens and every &quot;Select query&quot; =
will try to run till its completion without any such failure,=C2=A0until it=
 gets skilled by someone. Is my understanding correct?</div><div><br></div>=
<div>=C2=A0And in that case, It seems we have to mandatorily set &quot;stat=
ement_timeout&quot; to some=C2=A0value e.g. 4hrs(also i am not seeing a way=
 to set it for any specific user level, so it will be set for all queries i=
ncluding application level) and also &quot;idle_in_transaction_session_time=
out&quot; to 5minutes, even on all the prod and non prod databases, to rest=
rict the long running transactions/queries and avoid such issues in future.=
 Correct me if I&#39;m wrong.</div><div><div>=C2=A0<br></div><div>Regards</=
div><div>Sud</div></div></div>

--00000000000084b0b006191173ed--