Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1s9tKJ-005oiH-TR for pgsql-general@arkaria.postgresql.org; Wed, 22 May 2024 21:16:49 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1s9tKJ-003qQt-Pu for pgsql-general@arkaria.postgresql.org; Wed, 22 May 2024 21:16:47 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1s9tKJ-003qQk-8b for pgsql-general@lists.postgresql.org; Wed, 22 May 2024 21:16:47 +0000 Received: from mail-vk1-xa2e.google.com ([2607:f8b0:4864:20::a2e]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1s9tKG-001VsM-Gv for pgsql-general@lists.postgresql.org; Wed, 22 May 2024 21:16:46 +0000 Received: by mail-vk1-xa2e.google.com with SMTP id 71dfb90a1353d-4df4016b3c9so1899536e0c.1 for ; Wed, 22 May 2024 14:16:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1716412603; x=1717017403; darn=lists.postgresql.org; h=to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=1r259J3bEYoFTzGIJRSUpseOYaKzAvu0RCY+BcxkRKM=; b=BEyuIQ/iJv6KO7tz/tRfq6LbPS735aPBJM8QLZhnDr1POmpWhvlbixHO+185yK8rJF h1/8GrMWZSzrnzSSjteqF8twGWAnYe4uhryp15POU+VyHBUALxgyow9LRVCKTZQI+iNS NAQDcz3X5iP9G0wXo64Qj34V2JWsABu3yqpljMt1x7EIcIQxnm4gWlPLK+3gtOgrtLby Zhxp1p9ryIorUDy5x3OFLBV7XPWwRnO5mA836/K20KlaNWJ9u7qdSKGGn1MmyWTaEj7e Gk2jDXIdnq4xtNXrjkK//8FRa0vGFrheXY3/6bRLVxnOBiEvIiPR2ELrpKjSbjgpLxR5 zhCA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1716412603; x=1717017403; h=to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=1r259J3bEYoFTzGIJRSUpseOYaKzAvu0RCY+BcxkRKM=; b=B7oFNJzSt/38+WQzDH3bljK0n+LcC7LfigZPXmU4GZ0BVPzLuicFhhsEGZF5OrBipx 3b+68G1W9/SFh8OO3/e6kEwf9ypaisVR2K3xVJFJnCV49jjNcGmhtmY1BeCNbsS4hEzt SJ4IZPBbr5LSrmcJsh1ZxvZiqpIT9AIRbAr8lhfP4NpMv3EjkoD4H7HDpAylKHH5xKvl /4t6SEFgHM6QoCiqntZ6DvDewtEWVMhYzsGDIaaSp9Zt0COOEutnSPiOxOBaFpv6K449 KVSDYeSxl2zpNXCxF6SRQM3SBUrg3Z4ljgLVDpd32pz9/OxFoRxcyufm9zPtUZL1vCZQ CdMw== X-Gm-Message-State: AOJu0YyGauJ+iDy1NZ+L/7uCi/+IjuZ5gEcw3DxEyqQ8nhtnArD9zd56 iTL6v47nOGRON3SIKjBaHcJbG4eITH4oS7xl+NxkaeW42i/AZJjj8NTGtaBvJQh6E46GP4TIK/1 27Sz/8fE5Yp1BOgEmCX0zZp1N+6lRUHMo X-Google-Smtp-Source: AGHT+IGsNSDXsSrQqhEjgmuc5MUYUToc8cbBvOt2aFzFrlpAeyXTbZUUf729em9dI0BnVg4rHZwW0Lszd9KL7hE+NSI= X-Received: by 2002:a05:6122:7d2:b0:4d3:45a2:ae4f with SMTP id 71dfb90a1353d-4e21860fc21mr3188199e0c.14.1716412603476; Wed, 22 May 2024 14:16:43 -0700 (PDT) MIME-Version: 1.0 From: sud Date: Thu, 23 May 2024 02:46:31 +0530 Message-ID: Subject: Long running query causing XID limit breach To: pgsql-general Content-Type: multipart/alternative; boundary="00000000000084b0b006191173ed" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --00000000000084b0b006191173ed Content-Type: text/plain; charset="UTF-8" Hello , It's RDS postgres version 15.4. We suddenly saw the "MaximumUsedTransactionIDs" reach to ~1.5billion and got alerted by team members who mentioned the database is going to be in shutdown/hung if this value reaches to ~2billion and won't be able to serve any incoming transactions. It was a panic situation. I have heard of it before , because of the way postgres works and the XID being a datatype of length 32 bit integer can only represent (2^32)/2=~2 billion transactions. However, as RDS performs the auto vacuum , we thought that we need not worry about this issue. But it seems we were wrong. And we found one adhoc "SELECT '' query was running on the reader instance since the last couple of days and when that was killed, the max xid (MaximumUsedTransactionIDs) dropped to 50million immediately. So I have few questions, 1)This system is going to be a 24/7 up and running system which will process ~500million business transactions/day in future i.e. ~4-5billion rows/day inserted across multiple tables each day. And as I understand each row will have XID allocated. So in that case , does it mean that, we will need (5billion/24)=~200million XID/hour and thus , if any such legitimate application "SELECT" query keeps running for ~10 hours (and thus keep the historical XID alive) , then it can saturate the "MaximumUsedTransactionIDs" and make the database standstill in 2billion/200million=~10hrs. Is this understanding correct? Seems we are prone to hit this limit sooner going forward. 2)We have some legitimate cases where the reporting queries can run for 5-6hrs. So in such cases if the start of this SELECT query happen at 100th XID on table TAB1, then whatever transactions happen after that time, across all other tables(table2, table3 etc) in the database won't get vacuum until that SELECT query on table1 get vacuumed(as database will try to keep that same 100th XID image) and the XID will just keep incrementing for new transaction, eventually reaching the max limit. Is my understanding correct here? 3)Although RDS does the auto vacuum by default. but should we also consider doing manual vacuum without impacting ongoing transactions? Something as below options vacuum freeze tab1; vacuum freeze; vacuum; vacuum analyze tab1; vacuum tab1; 4)Had worked in past in oracle database where the similar transaction identifier is called as "system change number" , but never encountered that being exhausted and also there it used to have UNDO record and if a SELECT query needs anything beyond certain limit(set undo_retention parameter) the select query used to fail with snapshot too old error but not impacting any write transactions. But in postgres it seems nothing like that happens and every "Select query" will try to run till its completion without any such failure, until it gets skilled by someone. Is my understanding correct? And in that case, It seems we have to mandatorily set "statement_timeout" to some value e.g. 4hrs(also i am not seeing a way to set it for any specific user level, so it will be set for all queries including application level) and also "idle_in_transaction_session_timeout" to 5minutes, even on all the prod and non prod databases, to restrict the long running transactions/queries and avoid such issues in future. Correct me if I'm wrong. Regards Sud --00000000000084b0b006191173ed Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hello ,=C2=A0
It's RDS postgres version 15.4. We su= ddenly saw the "MaximumUsedTransactionIDs" reach to ~1.5billion a= nd got alerted by team members who mentioned the database is going to be in= shutdown/hung if this value reaches to ~2billion and won't be able to = serve any incoming transactions. It was a panic situation.

I have he= ard of it before , because of the way postgres works and the XID being a da= tatype of length 32 bit integer can only represent (2^32)/2=3D~2 billion tr= ansactions. However, as RDS performs the auto vacuum , we thought that we n= eed not worry about this issue. But it seems we were wrong. And we found on= e adhoc "SELECT '' query was running on the reader instance si= nce the last couple of days and when that was killed, the max xid (MaximumU= sedTransactionIDs) dropped to 50million immediately.=C2=A0

So I= have few questions,

1)This system is going to be a 24/7 up and runn= ing system which will process ~500million business transactions/day in futu= re i.e. ~4-5billion rows/day inserted across multiple tables each day. And = as I understand each row will have XID allocated. So in that case , does it= mean that, we will need (5billion/24)=3D~200million XID/hour and thus , if= any such legitimate application "SELECT" query keeps running for= ~10 hours (and thus keep the historical XID alive) , then it can saturate = the "MaximumUsedTransactionIDs" and make the database standstill = in 2billion/200million=3D~10hrs. Is this understanding correct? Seems we ar= e prone to hit this limit sooner going forward.

2)We have some legit= imate cases where the reporting queries can run for 5-6hrs. So in such case= s if the start of this SELECT query happen at 100th XID on table TAB1, then= whatever transactions happen after that time, across all other tables(tabl= e2, table3 etc) in the database won't get vacuum until that SELECT quer= y on table1 get vacuumed(as database will try to keep that same 100th XID i= mage) and the XID will just keep incrementing for new transaction, eventual= ly reaching the max limit. Is my understanding correct here?

3)Altho= ugh RDS does the auto vacuum by default. but should we also consider doing = manual vacuum without impacting ongoing transactions? Something as below op= tions
vacuum freeze tab1;
vacuum freeze;
vacuum;
vacuum analyz= e tab1;
vacuum tab1;

4)Had worked in past in oracle database wher= e the similar transaction identifier is called as "system change numbe= r" , but never encountered that being exhausted and also there it used= to have UNDO record and if a SELECT query needs anything beyond certain li= mit(set undo_retention=C2=A0parameter) the select query used to fail with s= napshot too old error but not impacting any write transactions. But in post= gres it seems nothing like that happens and every "Select query" = will try to run till its completion without any such failure,=C2=A0until it= gets skilled by someone. Is my understanding correct?

=
=C2=A0And in that case, It seems we have to mandatorily set "stat= ement_timeout" to some=C2=A0value e.g. 4hrs(also i am not seeing a way= to set it for any specific user level, so it will be set for all queries i= ncluding application level) and also "idle_in_transaction_session_time= out" to 5minutes, even on all the prod and non prod databases, to rest= rict the long running transactions/queries and avoid such issues in future.= Correct me if I'm wrong.
=C2=A0
Regards
Sud
--00000000000084b0b006191173ed--