Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1s9z9h-006mi9-Br for pgsql-general@arkaria.postgresql.org; Thu, 23 May 2024 03:30:14 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1s9z9h-007Ccd-Et for pgsql-general@arkaria.postgresql.org; Thu, 23 May 2024 03:30:13 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1s9z9g-007CcN-Ty for pgsql-general@lists.postgresql.org; Thu, 23 May 2024 03:30:13 +0000 Received: from mail-lj1-x22b.google.com ([2a00:1450:4864:20::22b]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1s9z9c-000JWT-Pw for pgsql-general@lists.postgresql.org; Thu, 23 May 2024 03:30:12 +0000 Received: by mail-lj1-x22b.google.com with SMTP id 38308e7fff4ca-2e271acb015so93872111fa.1 for ; Wed, 22 May 2024 20:30:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bitnine-net.20230601.gappssmtp.com; s=20230601; t=1716435007; x=1717039807; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=qgVGvol8Khm2unCzi1YFOvjr6OzQ/TuDPQTsH+5hFK0=; b=FGdG23Ek/7KQlPoU0Z2p+oo/5KgJW/LDmSiXD1A6cY4U1yh5W2TER7PYIxDBXobVUW QSUFF7TPMelT/ZgbLJgB42XEPpGsDRmUcmQeRxoc0Y0UWgZW3EW9A5fN/rLaledp+ld5 ven8lxg1hxDSFOw537w0zAM2yQOf8Uei9+K19a3N2HQHZZWy12ye+Ffhg4pkk4JyPqnO T0Pxt0qT1mGF3CJwqkZUNcoIqkS/LqgA62p3mQCO7zNxHl8sH6+jzqYeMs4xZgTq5nfw MEJoVF1NrNlyqHUgrQQVrA1oU29/bsIvmE5Xh430D7ECywAzKnSWFR7hGH+X42/Gomj6 ReHg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1716435007; x=1717039807; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=qgVGvol8Khm2unCzi1YFOvjr6OzQ/TuDPQTsH+5hFK0=; b=Zagftq8ZCClGM/5+iBcmQCxUF2iNZPD3sbd2cj0/l//SWrs1TXKlSpktNsPeFZ2mxx ils9Blp2A2IYPurgiD9aLxuczJcNdBUrl9RG7DejViyJntwElpkcBIplUd5RhuxqND7U 5WeWyJj6nV1UnNY745YEPwo0A19em5DgBAqM9vwoIcQaxBky/LlFaLbKwRsUj9zpwaKO NDI8GHCF94Ocg3dHYy1kmCyvg4xbWWQ/3LIuWJesy1XrLslX8bH7vnh73ISoZXnccu0D hv9ipa2XuSggoStFFkGDtmRWGiilAswdj41URLMGjvEIwkRYx6+oSufLhPQigRON8cXT 9lgg== X-Gm-Message-State: AOJu0YwXUw66Vfn2DDVBiBntE8JmMzBPOw8fbWQNcvN/bmvueOL5qGR5 aTWhS/GcHgL7fZfIrq9zlbbRxib2ogod3ptmkdKQkrRp2KkDFJtPO+PCYTj7vqM4+SGNDGlb8H5 CrTkZVWmj29zg7ZWTby9Oqh9AmC0n27LT/YEvwg== X-Google-Smtp-Source: AGHT+IHRB8pXuF50e8ZV4ff9F7rO2eZc6uChcVIUkzYUQO+jMIfL4lt7PHUphs5gDDhU+BtfaGKvuD9sK1ZqgqGfyRU= X-Received: by 2002:ac2:4e8e:0:b0:51f:3fea:cbce with SMTP id 2adb3069b0e04-526c07837cemr1724901e87.51.1716435006591; Wed, 22 May 2024 20:30:06 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Muhammad Salahuddin Manzoor Date: Thu, 23 May 2024 08:29:55 +0500 Message-ID: Subject: Re: Long running query causing XID limit breach To: sud Cc: pgsql-general Content-Type: multipart/alternative; boundary="000000000000d92fc2061916aaf6" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000d92fc2061916aaf6 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Greetings, In high-transaction environments like yours, it may be necessary to supplement this with manual vacuuming. Few Recommendations Monitor Long-Running Queries try to optimize. Optimize Autovacuum. Partitioning. Adopt Vacuum Strategy after peak hours. *Salahuddin (=EC=82=B4=EB=9D=BC=ED=9B=84=EB=94=98**)* On Thu, 23 May 2024 at 02:16, sud wrote: > Hello , > It's RDS postgres version 15.4. We suddenly saw the > "MaximumUsedTransactionIDs" reach to ~1.5billion and got alerted by team > members who mentioned the database is going to be in shutdown/hung if thi= s > value reaches to ~2billion and won't be able to serve any incoming > transactions. It was a panic situation. > > I have heard of it before , because of the way postgres works and the XID > being a datatype of length 32 bit integer can only represent (2^32)/2=3D~= 2 > billion transactions. However, as RDS performs the auto vacuum , we thoug= ht > that we need not worry about this issue. But it seems we were wrong. And = we > found one adhoc "SELECT '' query was running on the reader instance since > the last couple of days and when that was killed, the max xid > (MaximumUsedTransactionIDs) dropped to 50million immediately. > > So I have few questions, > > 1)This system is going to be a 24/7 up and running system which will > process ~500million business transactions/day in future i.e. ~4-5billion > rows/day inserted across multiple tables each day. And as I understand ea= ch > row will have XID allocated. So in that case , does it mean that, we will > need (5billion/24)=3D~200million XID/hour and thus , if any such legitima= te > application "SELECT" query keeps running for ~10 hours (and thus keep the > historical XID alive) , then it can saturate the > "MaximumUsedTransactionIDs" and make the database standstill in > 2billion/200million=3D~10hrs. Is this understanding correct? Seems we are > prone to hit this limit sooner going forward. > > 2)We have some legitimate cases where the reporting queries can run for > 5-6hrs. So in such cases if the start of this SELECT query happen at 100t= h > XID on table TAB1, then whatever transactions happen after that time, > across all other tables(table2, table3 etc) in the database won't get > vacuum until that SELECT query on table1 get vacuumed(as database will tr= y > to keep that same 100th XID image) and the XID will just keep incrementin= g > for new transaction, eventually reaching the max limit. Is my understandi= ng > correct here? > > 3)Although RDS does the auto vacuum by default. but should we also > consider doing manual vacuum without impacting ongoing transactions? > Something as below options > vacuum freeze tab1; > vacuum freeze; > vacuum; > vacuum analyze tab1; > vacuum tab1; > > 4)Had worked in past in oracle database where the similar transaction > identifier is called as "system change number" , but never encountered th= at > being exhausted and also there it used to have UNDO record and if a SELEC= T > query needs anything beyond certain limit(set undo_retention parameter) t= he > select query used to fail with snapshot too old error but not impacting a= ny > write transactions. But in postgres it seems nothing like that happens an= d > every "Select query" will try to run till its completion without any such > failure, until it gets skilled by someone. Is my understanding correct? > > And in that case, It seems we have to mandatorily set "statement_timeout= " > to some value e.g. 4hrs(also i am not seeing a way to set it for any > specific user level, so it will be set for all queries including > application level) and also "idle_in_transaction_session_timeout" to > 5minutes, even on all the prod and non prod databases, to restrict the lo= ng > running transactions/queries and avoid such issues in future. Correct me = if > I'm wrong. > > Regards > Sud > --000000000000d92fc2061916aaf6 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Greetings,

In high-transaction envi= ronments like yours, it may be necessary to supplement this with manual vac= uuming.

Few Recommendations

Monitor Long-Running Queries try = to optimize.
Optimize Autovacuum.
Partitioning.
Adopt Vacuum Strat= egy after peak hours.

Salahuddin (=EC=82=B4=EB=9D=BC=ED=9B=84=EB=94=98)



On Thu, 23 May 2024 at 02:16, sud <suds1434@gmail.com> wrote:
Hello ,=C2=A0
It's RDS po= stgres version 15.4. We suddenly saw the "MaximumUsedTransactionIDs&qu= ot; reach to ~1.5billion and got alerted by team members who mentioned the = database is going to be in shutdown/hung if this value reaches to ~2billion= and won't be able to serve any incoming transactions. It was a panic s= ituation.

I have heard of it before , because of the way postgres wo= rks and the XID being a datatype of length 32 bit integer can only represen= t (2^32)/2=3D~2 billion transactions. However, as RDS performs the auto vac= uum , we thought that we need not worry about this issue. But it seems we w= ere wrong. And we found one adhoc "SELECT '' query was running= on the reader instance since the last couple of days and when that was kil= led, the max xid (MaximumUsedTransactionIDs) dropped to 50million immediate= ly.=C2=A0

So I have few questions,

1)This system is goin= g to be a 24/7 up and running system which will process ~500million busines= s transactions/day in future i.e. ~4-5billion rows/day inserted across mult= iple tables each day. And as I understand each row will have XID allocated.= So in that case , does it mean that, we will need (5billion/24)=3D~200mill= ion XID/hour and thus , if any such legitimate application "SELECT&quo= t; query keeps running for ~10 hours (and thus keep the historical XID aliv= e) , then it can saturate the "MaximumUsedTransactionIDs" and mak= e the database standstill in 2billion/200million=3D~10hrs. Is this understa= nding correct? Seems we are prone to hit this limit sooner going forward.
2)We have some legitimate cases where the reporting queries can run f= or 5-6hrs. So in such cases if the start of this SELECT query happen at 100= th XID on table TAB1, then whatever transactions happen after that time, ac= ross all other tables(table2, table3 etc) in the database won't get vac= uum until that SELECT query on table1 get vacuumed(as database will try to = keep that same 100th XID image) and the XID will just keep incrementing for= new transaction, eventually reaching the max limit. Is my understanding co= rrect here?

3)Although RDS does the auto vacuum by default. but shou= ld we also consider doing manual vacuum without impacting ongoing transacti= ons? Something as below options
vacuum freeze tab1;
vacuum freeze; vacuum;
vacuum analyze tab1;
vacuum tab1;

4)Had worked in pa= st in oracle database where the similar transaction identifier is called as= "system change number" , but never encountered that being exhaus= ted and also there it used to have UNDO record and if a SELECT query needs = anything beyond certain limit(set undo_retention=C2=A0parameter) the select= query used to fail with snapshot too old error but not impacting any write= transactions. But in postgres it seems nothing like that happens and every= "Select query" will try to run till its completion without any s= uch failure,=C2=A0until it gets skilled by someone. Is my understanding cor= rect?

=C2=A0And in that case, It seems we have to = mandatorily set "statement_timeout" to some=C2=A0value e.g. 4hrs(= also i am not seeing a way to set it for any specific user level, so it wil= l be set for all queries including application level) and also "idle_i= n_transaction_session_timeout" to 5minutes, even on all the prod and n= on prod databases, to restrict the long running transactions/queries and av= oid such issues in future. Correct me if I'm wrong.
=C2= =A0
Regards
Sud
--000000000000d92fc2061916aaf6--