Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sB6v1-00EL1X-1b for pgsql-general@arkaria.postgresql.org; Sun, 26 May 2024 05:59:44 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1sB6uz-003Wb6-EG for pgsql-general@arkaria.postgresql.org; Sun, 26 May 2024 05:59:41 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sB6uy-003Way-VV for pgsql-general@lists.postgresql.org; Sun, 26 May 2024 05:59:41 +0000 Received: from mail-ej1-x633.google.com ([2a00:1450:4864:20::633]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1sB6uu-000oHj-MK for pgsql-general@lists.postgresql.org; Sun, 26 May 2024 05:59:40 +0000 Received: by mail-ej1-x633.google.com with SMTP id a640c23a62f3a-a62c17cf527so10120166b.2 for ; Sat, 25 May 2024 22:59:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1716703175; x=1717307975; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=w7788Ptgj/zBGnZX4XppqYaOwAn+4xlUjI7rApXl/ds=; b=gYS+bc6i67Wd3wfEQbOt7WXliHXlt7JVIxMozgK2oCZKgbGBIXmm1Anoee5xCa+D6q SJO9iXLgpKvx9FW8m4wFJjUoAiW7j5D7KRnM/CORFs8uwDMRscwm7IN6RNqioLaDW7Re CG6Au+NEWHfmFm3cC4W9VN95u2A84G/lVhp0f84QlYJrdLht4s9EhoAwlZmB86CvVaVP BkABChrbWHURSqhrpni5948CEwdzyZSsS0ACgDA2owzqFQYXLgDIIVz6Ush/Ljz88Wjr ybHY9Lf/AJGRLbzo3W2uS8B1/c6z/bTDPxRcweeo/hyxgXEyeV/SW5R5O5iAznZm+UVd zEvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1716703175; x=1717307975; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=w7788Ptgj/zBGnZX4XppqYaOwAn+4xlUjI7rApXl/ds=; b=Yu2MQ3k98rQvGssD9hITxI/sQvsieK7K6nxjWYOlvHc9slXbRcRGbNsvBDnxIUgVLK Tl7j46WFr3d8UPDrcH0oL0vELbMoeKoOCK8Fw7UoZFepn+dAJMCBZ2+4ZQuyzT58V1uU zNWMFw9YcTvbMKFWzaiKf7yqNQ+BQexTN4cdxYOYSw25zV2uzNyzdG/uTQ0xsq66JaFQ DaXjyWRU4RgzVHmsQXh/zUp3oGeflwDU1Xw3ZmFTOtlzChnvbntkIdyuIaavJW4rL2i2 xzxfgWTQnso4BJROmrUWgUgJN++BBIm1/b1yMqu1IJMbetw+q7K5j6Z+1gyYio+9sJ8T Ke/Q== X-Gm-Message-State: AOJu0YyxhFsAvZNDTrFv71bbd/3xFyOO2/ms8jdaKoV3sJdVXDR2aeMZ 4JATgpIqvJo+XWAv4mhY0i/eMp/QuyU3rTdvEBmz3/R5dOjRvoMIDF2n8ZUyuGlfKJmJmMvv8IR S9wGGG8vAUNJgEu/Hjj9hQ/To2ss= X-Google-Smtp-Source: AGHT+IH1dAtcnV8asCUDe3Yb6Rlqmpv1wVzAcShMo/Z9xyQDMKQio5y0lSfIVI4rGSCmQkdmm0rc7+QNGrPRG1EO7qI= X-Received: by 2002:a17:906:b014:b0:a59:a5af:41da with SMTP id a640c23a62f3a-a6264f0ec0amr383585466b.4.1716703174915; Sat, 25 May 2024 22:59:34 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: David HJ Date: Sun, 26 May 2024 13:56:16 +0800 Message-ID: Subject: Re: Long running query causing XID limit breach To: sud Cc: pgsql-general Content-Type: multipart/alternative; boundary="000000000000ecfb6a0619551aca" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000ecfb6a0619551aca Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable anyone know how to describe from this mailing list? On Thu, May 23, 2024 at 5:16=E2=80=AFAM sud wrote: > Hello , > It's RDS postgres version 15.4. We suddenly saw the > "MaximumUsedTransactionIDs" reach to ~1.5billion and got alerted by team > members who mentioned the database is going to be in shutdown/hung if thi= s > value reaches to ~2billion and won't be able to serve any incoming > transactions. It was a panic situation. > > I have heard of it before , because of the way postgres works and the XID > being a datatype of length 32 bit integer can only represent (2^32)/2=3D~= 2 > billion transactions. However, as RDS performs the auto vacuum , we thoug= ht > that we need not worry about this issue. But it seems we were wrong. And = we > found one adhoc "SELECT '' query was running on the reader instance since > the last couple of days and when that was killed, the max xid > (MaximumUsedTransactionIDs) dropped to 50million immediately. > > So I have few questions, > > 1)This system is going to be a 24/7 up and running system which will > process ~500million business transactions/day in future i.e. ~4-5billion > rows/day inserted across multiple tables each day. And as I understand ea= ch > row will have XID allocated. So in that case , does it mean that, we will > need (5billion/24)=3D~200million XID/hour and thus , if any such legitima= te > application "SELECT" query keeps running for ~10 hours (and thus keep the > historical XID alive) , then it can saturate the > "MaximumUsedTransactionIDs" and make the database standstill in > 2billion/200million=3D~10hrs. Is this understanding correct? Seems we are > prone to hit this limit sooner going forward. > > 2)We have some legitimate cases where the reporting queries can run for > 5-6hrs. So in such cases if the start of this SELECT query happen at 100t= h > XID on table TAB1, then whatever transactions happen after that time, > across all other tables(table2, table3 etc) in the database won't get > vacuum until that SELECT query on table1 get vacuumed(as database will tr= y > to keep that same 100th XID image) and the XID will just keep incrementin= g > for new transaction, eventually reaching the max limit. Is my understandi= ng > correct here? > > 3)Although RDS does the auto vacuum by default. but should we also > consider doing manual vacuum without impacting ongoing transactions? > Something as below options > vacuum freeze tab1; > vacuum freeze; > vacuum; > vacuum analyze tab1; > vacuum tab1; > > 4)Had worked in past in oracle database where the similar transaction > identifier is called as "system change number" , but never encountered th= at > being exhausted and also there it used to have UNDO record and if a SELEC= T > query needs anything beyond certain limit(set undo_retention parameter) t= he > select query used to fail with snapshot too old error but not impacting a= ny > write transactions. But in postgres it seems nothing like that happens an= d > every "Select query" will try to run till its completion without any such > failure, until it gets skilled by someone. Is my understanding correct? > > And in that case, It seems we have to mandatorily set "statement_timeout= " > to some value e.g. 4hrs(also i am not seeing a way to set it for any > specific user level, so it will be set for all queries including > application level) and also "idle_in_transaction_session_timeout" to > 5minutes, even on all the prod and non prod databases, to restrict the lo= ng > running transactions/queries and avoid such issues in future. Correct me = if > I'm wrong. > > Regards > Sud > --000000000000ecfb6a0619551aca Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
anyone know how to describe from this mai= ling list?=C2=A0

On Thu, May 23, 2024 at 5:16=E2=80=AFAM sud <suds1434@gmail.com> wrote:
Hello ,=C2=A0=
It's RDS postgres version 15.4. We suddenly saw the "MaximumUs= edTransactionIDs" reach to ~1.5billion and got alerted by team members= who mentioned the database is going to be in shutdown/hung if this value r= eaches to ~2billion and won't be able to serve any incoming transaction= s. It was a panic situation.

I have heard of it before , because of = the way postgres works and the XID being a datatype of length 32 bit intege= r can only represent (2^32)/2=3D~2 billion transactions. However, as RDS pe= rforms the auto vacuum , we thought that we need not worry about this issue= . But it seems we were wrong. And we found one adhoc "SELECT ''= ; query was running on the reader instance since the last couple of days an= d when that was killed, the max xid (MaximumUsedTransactionIDs) dropped to = 50million immediately.=C2=A0

So I have few questions,

1)= This system is going to be a 24/7 up and running system which will process = ~500million business transactions/day in future i.e. ~4-5billion rows/day i= nserted across multiple tables each day. And as I understand each row will = have XID allocated. So in that case , does it mean that, we will need (5bil= lion/24)=3D~200million XID/hour and thus , if any such legitimate applicati= on "SELECT" query keeps running for ~10 hours (and thus keep the = historical XID alive) , then it can saturate the "MaximumUsedTransacti= onIDs" and make the database standstill in 2billion/200million=3D~10hr= s. Is this understanding correct? Seems we are prone to hit this limit soon= er going forward.

2)We have some legitimate cases where the reportin= g queries can run for 5-6hrs. So in such cases if the start of this SELECT = query happen at 100th XID on table TAB1, then whatever transactions happen = after that time, across all other tables(table2, table3 etc) in the databas= e won't get vacuum until that SELECT query on table1 get vacuumed(as da= tabase will try to keep that same 100th XID image) and the XID will just ke= ep incrementing for new transaction, eventually reaching the max limit. Is = my understanding correct here?

3)Although RDS does the auto vacuum b= y default. but should we also consider doing manual vacuum without impactin= g ongoing transactions? Something as below options
vacuum freeze tab1;vacuum freeze;
vacuum;
vacuum analyze tab1;
vacuum tab1;
4)Had worked in past in oracle database where the similar transaction iden= tifier is called as "system change number" , but never encountere= d that being exhausted and also there it used to have UNDO record and if a = SELECT query needs anything beyond certain limit(set undo_retention=C2=A0pa= rameter) the select query used to fail with snapshot too old error but not = impacting any write transactions. But in postgres it seems nothing like tha= t happens and every "Select query" will try to run till its compl= etion without any such failure,=C2=A0until it gets skilled by someone. Is m= y understanding correct?

=C2=A0And in that case, I= t seems we have to mandatorily set "statement_timeout" to some=C2= =A0value e.g. 4hrs(also i am not seeing a way to set it for any specific us= er level, so it will be set for all queries including application level) an= d also "idle_in_transaction_session_timeout" to 5minutes, even on= all the prod and non prod databases, to restrict the long running transact= ions/queries and avoid such issues in future. Correct me if I'm wrong.<= /div>
=C2=A0
Regards
Sud
--000000000000ecfb6a0619551aca--