MIME-Version: 1.0
References: <CAD=mzVXR3GjM0vcthMBwEdbOKqSKcv8oojSS9coczWRi9BRYTA@mail.gmail.com>
In-Reply-To: <CAD=mzVXR3GjM0vcthMBwEdbOKqSKcv8oojSS9coczWRi9BRYTA@mail.gmail.com>
From: David HJ <chuxiongzhong@gmail.com>
Date: Sun, 26 May 2024 13:56:16 +0800
Message-ID: <CAKabb9XsSKEzmYV+WKPptLFPVYbqrD_W8UJKiQqW5euyS2HZoQ@mail.gmail.com>
Subject: Re: Long running query causing XID limit breach
To: sud <suds1434@gmail.com>
Cc: pgsql-general <pgsql-general@lists.postgresql.org>
Content-Type: multipart/alternative; boundary="000000000000ecfb6a0619551aca"
Archived-At: <https://www.postgresql.org/message-id/CAKabb9XsSKEzmYV%2BWKPptLFPVYbqrD_W8UJKiQqW5euyS2HZoQ%40mail.gmail.com>
Precedence: bulk

--000000000000ecfb6a0619551aca
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

anyone know how to describe from this mailing list?

On Thu, May 23, 2024 at 5:16=E2=80=AFAM sud <suds1434@gmail.com> wrote:

> Hello ,
> It's RDS postgres version 15.4. We suddenly saw the
> "MaximumUsedTransactionIDs" reach to ~1.5billion and got alerted by team
> members who mentioned the database is going to be in shutdown/hung if thi=
s
> value reaches to ~2billion and won't be able to serve any incoming
> transactions. It was a panic situation.
>
> I have heard of it before , because of the way postgres works and the XID
> being a datatype of length 32 bit integer can only represent (2^32)/2=3D~=
2
> billion transactions. However, as RDS performs the auto vacuum , we thoug=
ht
> that we need not worry about this issue. But it seems we were wrong. And =
we
> found one adhoc "SELECT '' query was running on the reader instance since
> the last couple of days and when that was killed, the max xid
> (MaximumUsedTransactionIDs) dropped to 50million immediately.
>
> So I have few questions,
>
> 1)This system is going to be a 24/7 up and running system which will
> process ~500million business transactions/day in future i.e. ~4-5billion
> rows/day inserted across multiple tables each day. And as I understand ea=
ch
> row will have XID allocated. So in that case , does it mean that, we will
> need (5billion/24)=3D~200million XID/hour and thus , if any such legitima=
te
> application "SELECT" query keeps running for ~10 hours (and thus keep the
> historical XID alive) , then it can saturate the
> "MaximumUsedTransactionIDs" and make the database standstill in
> 2billion/200million=3D~10hrs. Is this understanding correct? Seems we are
> prone to hit this limit sooner going forward.
>
> 2)We have some legitimate cases where the reporting queries can run for
> 5-6hrs. So in such cases if the start of this SELECT query happen at 100t=
h
> XID on table TAB1, then whatever transactions happen after that time,
> across all other tables(table2, table3 etc) in the database won't get
> vacuum until that SELECT query on table1 get vacuumed(as database will tr=
y
> to keep that same 100th XID image) and the XID will just keep incrementin=
g
> for new transaction, eventually reaching the max limit. Is my understandi=
ng
> correct here?
>
> 3)Although RDS does the auto vacuum by default. but should we also
> consider doing manual vacuum without impacting ongoing transactions?
> Something as below options
> vacuum freeze tab1;
> vacuum freeze;
> vacuum;
> vacuum analyze tab1;
> vacuum tab1;
>
> 4)Had worked in past in oracle database where the similar transaction
> identifier is called as "system change number" , but never encountered th=
at
> being exhausted and also there it used to have UNDO record and if a SELEC=
T
> query needs anything beyond certain limit(set undo_retention parameter) t=
he
> select query used to fail with snapshot too old error but not impacting a=
ny
> write transactions. But in postgres it seems nothing like that happens an=
d
> every "Select query" will try to run till its completion without any such
> failure, until it gets skilled by someone. Is my understanding correct?
>
>  And in that case, It seems we have to mandatorily set "statement_timeout=
"
> to some value e.g. 4hrs(also i am not seeing a way to set it for any
> specific user level, so it will be set for all queries including
> application level) and also "idle_in_transaction_session_timeout" to
> 5minutes, even on all the prod and non prod databases, to restrict the lo=
ng
> running transactions/queries and avoid such issues in future. Correct me =
if
> I'm wrong.
>
> Regards
> Sud
>

--000000000000ecfb6a0619551aca
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr">anyone know how to describe from this mai=
ling list?=C2=A0</div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=
=3D"gmail_attr">On Thu, May 23, 2024 at 5:16=E2=80=AFAM sud &lt;<a href=3D"=
mailto:suds1434@gmail.com">suds1434@gmail.com</a>&gt; wrote:<br></div><bloc=
kquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:=
1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr">Hello ,=C2=A0=
<br>It&#39;s RDS postgres version 15.4. We suddenly saw the &quot;MaximumUs=
edTransactionIDs&quot; reach to ~1.5billion and got alerted by team members=
 who mentioned the database is going to be in shutdown/hung if this value r=
eaches to ~2billion and won&#39;t be able to serve any incoming transaction=
s. It was a panic situation.<br><br>I have heard of it before , because of =
the way postgres works and the XID being a datatype of length 32 bit intege=
r can only represent (2^32)/2=3D~2 billion transactions. However, as RDS pe=
rforms the auto vacuum , we thought that we need not worry about this issue=
. But it seems we were wrong. And we found one adhoc &quot;SELECT &#39;&#39=
; query was running on the reader instance since the last couple of days an=
d when that was killed, the max xid (MaximumUsedTransactionIDs) dropped to =
50million immediately.=C2=A0<br><br><div>So I have few questions,<br><br>1)=
This system is going to be a 24/7 up and running system which will process =
~500million business transactions/day in future i.e. ~4-5billion rows/day i=
nserted across multiple tables each day. And as I understand each row will =
have XID allocated. So in that case , does it mean that, we will need (5bil=
lion/24)=3D~200million XID/hour and thus , if any such legitimate applicati=
on &quot;SELECT&quot; query keeps running for ~10 hours (and thus keep the =
historical XID alive) , then it can saturate the &quot;MaximumUsedTransacti=
onIDs&quot; and make the database standstill in 2billion/200million=3D~10hr=
s. Is this understanding correct? Seems we are prone to hit this limit soon=
er going forward.<br><br>2)We have some legitimate cases where the reportin=
g queries can run for 5-6hrs. So in such cases if the start of this SELECT =
query happen at 100th XID on table TAB1, then whatever transactions happen =
after that time, across all other tables(table2, table3 etc) in the databas=
e won&#39;t get vacuum until that SELECT query on table1 get vacuumed(as da=
tabase will try to keep that same 100th XID image) and the XID will just ke=
ep incrementing for new transaction, eventually reaching the max limit. Is =
my understanding correct here?<br><br>3)Although RDS does the auto vacuum b=
y default. but should we also consider doing manual vacuum without impactin=
g ongoing transactions? Something as below options<br>vacuum freeze tab1;<b=
r>vacuum freeze; <br>vacuum;<br>vacuum analyze tab1;<br>vacuum tab1;<br><br=
>4)Had worked in past in oracle database where the similar transaction iden=
tifier is called as &quot;system change number&quot; , but never encountere=
d that being exhausted and also there it used to have UNDO record and if a =
SELECT query needs anything beyond certain limit(set undo_retention=C2=A0pa=
rameter) the select query used to fail with snapshot too old error but not =
impacting any write transactions. But in postgres it seems nothing like tha=
t happens and every &quot;Select query&quot; will try to run till its compl=
etion without any such failure,=C2=A0until it gets skilled by someone. Is m=
y understanding correct?</div><div><br></div><div>=C2=A0And in that case, I=
t seems we have to mandatorily set &quot;statement_timeout&quot; to some=C2=
=A0value e.g. 4hrs(also i am not seeing a way to set it for any specific us=
er level, so it will be set for all queries including application level) an=
d also &quot;idle_in_transaction_session_timeout&quot; to 5minutes, even on=
 all the prod and non prod databases, to restrict the long running transact=
ions/queries and avoid such issues in future. Correct me if I&#39;m wrong.<=
/div><div><div>=C2=A0<br></div><div>Regards</div><div>Sud</div></div></div>
</blockquote></div></div>

--000000000000ecfb6a0619551aca--