Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sBHzG-00Fc34-49 for pgsql-general@arkaria.postgresql.org; Sun, 26 May 2024 17:48:51 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1sBHzF-004PEY-14 for pgsql-general@arkaria.postgresql.org; Sun, 26 May 2024 17:48:49 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sBHzE-004PEQ-FI for pgsql-general@lists.postgresql.org; Sun, 26 May 2024 17:48:48 +0000 Received: from mail-lf1-x12f.google.com ([2a00:1450:4864:20::12f]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1sBHz6-0026Vb-O6 for pgsql-general@lists.postgresql.org; Sun, 26 May 2024 17:48:47 +0000 Received: by mail-lf1-x12f.google.com with SMTP id 2adb3069b0e04-5295ae273c8so2634746e87.0 for ; Sun, 26 May 2024 10:48:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1716745718; x=1717350518; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=iRnmRhbvyjcsoaLBOZyN9P4rq5mVopaTwz7c2TGidlM=; b=Hlu36XZMlZbmuQXRoDH9Jg6EAGZSKzr/lm4q+PinPzbgLNzR9YoqWM0ybF6DxnvLzJ Q9GIP2iuQ3xvnYZWUvKHYvY5v7ze44c+pRrDnOtKIfOXHrVcGZ82a/9jkkJGDotGz3Ft p9fLqhPf0dO7DbDMMLutUlbFl9TUv6OVSmkpOcVdqHaJ8RoCbkoXNU5a0CVrm0jae9b8 6kG9JY1eQS2t0H8TRhi4Ve5Ese9tCibSnQrd+ohA+w6bs9rgwOKHFSrsOCvme9JnV1oO Uy6MiCKUk1PJTWPNQ2rM4Og4Qpp9rEK3vkrHAJ1bYk037VWWDaJZj6rUcgHuVkGilEcl TBCg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1716745718; x=1717350518; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=iRnmRhbvyjcsoaLBOZyN9P4rq5mVopaTwz7c2TGidlM=; b=Y19nu8NgV5CFx+7ofXDdj46P0w2z0sA9/zqYbY7udRtxjpZbe7qliu1IbhiWSxImxR gXQ0AYyyNiMhl9e3rnq/lX72oLLq5Z+9XVsBcg1Hmlm/AcRK2cDDUbzXiXnL6xbXQSYM yfdGuntBRPhX9YFH1JOFa9dXLZm6VidBOq1mk4TjvEyYcXp1JkWS1b9KCmxcQ8vkiCDq vnkosNS94CAQEiA/EUr0uHUApOFq3aKgS9Pky1Leqg8nq8grKnL4BSmjZUvs3rnVWaU9 9T9b2RT6fuwcP7DAAalmhdVMFBqDqzeoRf61sNZ57uCImMSu+9FmSe2oEvUSCILQ6Dew irHQ== X-Gm-Message-State: AOJu0Yxv3UuWkiKrn2WDU4YTSlKnSQXq4EPsIo02veWlFjnxcOb778XS ALocvqjW4kfH6sI4tHPZV/8AORHoH2Z+hWw+t9HoTB6Wzzbtcgl24geajwB5At/muQI/qROL7Pp B2mkCHkcvd9+xxrZDXx5oqHVtdw== X-Google-Smtp-Source: AGHT+IE1/rBWGkENXZ9af0PE6c12NTAMFM+xv7CzVhUE/8cEL7SERa9o2UK3KpmSrceRFWRF1ZKXI0LvqW3wL7BYBLw= X-Received: by 2002:a05:6512:3128:b0:522:33cd:64bd with SMTP id 2adb3069b0e04-52964aba466mr4550774e87.27.1716745717385; Sun, 26 May 2024 10:48:37 -0700 (PDT) MIME-Version: 1.0 References: <891bcfec74f7358ef0212caf6565a35153dd2941.camel@cybertec.at> In-Reply-To: From: =?UTF-8?Q?Torsten_F=C3=B6rtsch?= Date: Sun, 26 May 2024 19:48:25 +0200 Message-ID: Subject: Re: Long running query causing XID limit breach To: sud Cc: pgsql-general Content-Type: multipart/alternative; boundary="000000000000a792ab06195f027d" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000a792ab06195f027d Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Sun, May 26, 2024 at 11:16=E2=80=AFAM sud wrote: > On Sun, May 26, 2024 at 1:43=E2=80=AFPM Torsten F=C3=B6rtsch > wrote: > >> On Sat, May 25, 2024 at 11:00=E2=80=AFPM sud wrote: >> >>> >>> But i have one question here , does max_standby_streaming_delay =3D 14 = , >>> means the queries on the standby will get cancelled after 14 seconds? >>> >> >> No, your query gets cancelled when it stalls replication for >14 sec. If >> your master is idle and does not send any WAL and the replica has >> caught up, the query can take as long as it wants. >> > > Thank you so much. > For example , in below scenario, > if i have insert query going on on primary instance on table 25th may > partition of TABLE1, and at same time we are selecting data from 24th May > partition , then with "max_standby_streaming_delay =3D 14" setup , it jus= t > allows the select query to run for any duration without any restriction > even if the WAL gets applied on the standby regularly. Also INSERT query = in > primary won't make the standby SELECT queries to cancel as because the WA= L > record of INSERT queries on the primary instance is not conflicting to th= e > exact rows those were being read by the standby. Is my understanding > correct here? > > However, if i have Update/Delete query going on on primary instance on > table 25th may partition of TABLE1 and on the exact same set of rows whi= ch > were being read by the standby instance by the SELECT query, then the > application of such WAL record to standby can max wait for 14 seconds and > thus those select query are prone to be cancelled after 14 seconds. Is th= is > understanding correct? > > If the above is true then it doesn't look good, as because in an OLTP > system there will be a lot of DMLS happening on the writer instances and > there may be many queries running on the reader/standby instances which a= re > meant to run for hours. And if we say making those SELECT queries run for > hours means compromising an hour of "high availability"/RPO or a lag of a= n > hour between primary and standby , that doesn't look good. Please > correct me if I am missing something here. > Each query on the replica has a backend_xmin. You can see that in pg_stat_activity. From that backend's perspective, tuples marked as deleted by any transaction greater or equal to backend_xmin are still needed. This does not depend on the table. Now, vacuum writes to the WAL up to which point it has vacuumed on the master. In pg_waldump this looks like so: PRUNE snapshotConflictHorizon: 774, nredirected: 0, ndead: 5, nunused: 0, redirected: [], dead: [2, 4, 6, 8, 10], unused: [], blkref #0: rel 1663/5/16430 blk 0 That snapshotConflictHorizon is also a transaction id. If the backend_xmin of all backends running transactions in the same database (the 5 in 16 63/5/16430) -as the vacuum WAL record is greater than vacuum's snapshotConflictHorizon, then there is no conflict. If any of the backend_xmin's is less, then there is a conflict. This type of conflict is determined by just 2 numbers, the conflict horizon sent by the master in the WAL, and the minimum of all backend_xmins. For your case this means a long running transaction querying table t1 might have a backend_xmin of 223. On the master update and delete operations happen on table T2. Since all the transactions on the master are fast, when vacuum hits T2, the minimum of all backend_xmins on the master might already be 425. So, garbage left over by all transactions up to 424 can be cleaned up. Now that cleanup record reaches the replica. It compares 223>425 which is false. So, there is a conflict. Now the replica can wait until its own horizon reaches 425 or it can kill all backends with a lower backend_xmin. As I understand, hot_standby_feedback does not work for you. Not sure if you can run the query on the master? That would resolve the issues but might generate the same bloat on the master as hot_standby_feedback. Another option I can see is to run long running queries on a dedicated replica with max_standby_streaming_delay set to infinity or something large enough. If you go that way, you could also fetch the WAL from your WAL archive instead of replicating from the master. That way the replica has absolutely no chance to affect the master. Good Luck! Torsten --000000000000a792ab06195f027d Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Sun, May 26, 2024 at 11:16=E2=80=AFAM = sud <suds1434@gmail.com> wr= ote:
On Sun, May 26, 2024 at 1= :43=E2=80=AFPM Torsten F=C3=B6rtsch <tfoertsch123@gmail.com> wrote:
<= div dir=3D"ltr">
On Sat, May 25, 2024 at 11:00=E2=80=AFPM s= ud <suds1434@gma= il.com> wrote:
=

But i have one question = here , does=C2=A0max_standby_streaming_delay =3D 14 , means the queries on = the standby will get cancelled after 14 seconds?

No, your query gets cancelled when it stalls replic= ation for >14 sec. If your master is idle and does not send any WAL and = the replica has caught=C2=A0up, the query can take as long as it wants.=C2= =A0

Thank you so mu= ch.=C2=A0
For example , in below scenario,
if i have in= sert query going on on primary instance on table=C2=A0 25th may partition o= f TABLE1, and at same time we are selecting data from 24th May partition , = then with "max_standby_streaming_delay =3D 14" setup , it just al= lows the select query to run for any duration without any restriction even = if the WAL=C2=A0gets applied on the standby regularly. Also INSERT query in= primary won't=C2=A0make the standby SELECT queries to cancel as becaus= e the WAL record of INSERT queries on the primary instance is not conflicti= ng to the exact=C2=A0rows those were being read by the standby. Is my under= standing correct here?

However, if i have Upd= ate/Delete query going on on primary instance on table=C2=A0 25th may parti= tion of TABLE1 and on the exact same set of rows which were being read=C2= =A0by the standby instance by the SELECT query, then the application of suc= h=C2=A0WAL=C2=A0record to standby can max wait for 14 seconds and thus thos= e select query are prone to be cancelled=C2=A0after 14 seconds. Is this und= erstanding correct?

If the above is true then it d= oesn't=C2=A0look good, as because in an OLTP system there will be a lot= of DMLS happening on the writer instances and there may be many queries ru= nning on the reader/standby instances which are meant to run for hours. And= if we say making those SELECT queries run for hours means compromising an = hour of "high availability"/RPO or a lag of an hour between prima= ry and standby , that doesn't=C2=A0look good. Please correct=C2=A0me if= I am missing something here.

=
Each query on the replica has a backend_xmin. You can see that i= n pg_stat_activity. From that backend's perspective, tuples marked as d= eleted by any transaction greater or equal to backend_xmin are still needed= . This does not depend on the table.

Now, vacuum w= rites to the WAL up to which point it has vacuumed on the master. In pg_wal= dump this looks like so:

PRUNE snapshotConflictHorizon= : 774, nredirected: 0, ndead: 5, nunused: 0, redirected: [], dead: [2, 4, 6= , 8, 10], unused: [], blkref #0: rel 1663/5/16430 blk 0

That snapshotConflictHorizon is also a transaction = id. If the backend_xmin of all backends running transactions in the same da= tabase (the 5 in=C2=A01663/5/16430) -as the= vacuum WAL record is greater than vacuum's snapshotConflictHorizon, th= en there is no conflict. If any of the backend_xmin's is less, then the= re is a conflict.

This type of conflict is determi= ned by just 2 numbers, the conflict horizon sent by the master in the WAL, = and the minimum of all backend_xmins. For your case this means a long runni= ng transaction querying table t1 might have a backend_xmin of 223. On the m= aster update and delete operations happen on table T2. Since all the transa= ctions on the master are fast, when vacuum hits T2, the minimum of all back= end_xmins on the master might already be 425. So, garbage left over by all = transactions up to 424 can be cleaned up. Now that cleanup record reaches t= he replica. It compares 223>425 which is false. So, there is a conflict.= Now the replica can wait until its own horizon reaches 425 or it can kill = all backends with a lower backend_xmin.

As I under= stand, hot_standby_feedback does not work for you. Not sure if you can run = the query on the master? That would resolve the issues but might generate t= he same bloat on the master as hot_standby_feedback. Another option I can s= ee is to run long running queries on a dedicated replica with max_standby_s= treaming_delay set to infinity or something large enough. If you go that wa= y, you could also fetch the WAL from your WAL=C2=A0archive instead of repli= cating from the master. That way the replica has absolutely no chance to af= fect the master.

Good Luck!
Torsten
--000000000000a792ab06195f027d--