Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sA2Ik-0077Dl-M2 for pgsql-general@arkaria.postgresql.org; Thu, 23 May 2024 06:51:48 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1sA2Ik-008znM-14 for pgsql-general@arkaria.postgresql.org; Thu, 23 May 2024 06:51:46 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sA2Ij-008zlM-EY for pgsql-general@lists.postgresql.org; Thu, 23 May 2024 06:51:45 +0000 Received: from mail-lj1-x235.google.com ([2a00:1450:4864:20::235]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1sA2Ig-001Zwc-1J for pgsql-general@lists.postgresql.org; Thu, 23 May 2024 06:51:44 +0000 Received: by mail-lj1-x235.google.com with SMTP id 38308e7fff4ca-2e538a264e0so101858811fa.1 for ; Wed, 22 May 2024 23:51:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bitnine-net.20230601.gappssmtp.com; s=20230601; t=1716447100; x=1717051900; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=o4q6ro1QAXiOUhbtOATI+747eS1KRPVsmhuOnnTVCRM=; b=p+unGLcry9mW1huCzEWKqRPzwIgKGCWgXMyr62qMywri9MnK54M7meowZx9zJVIWM7 gluS/Mr2ZrnSzHR7o3V52P879IeDN66OR6xygoR7l01TLORmGR8mELJ6hjvlXsjNUuZN Ym0KXXMwxb9IaSl2xJPbDxoZy9cDcLm4eGZDWmIDAM90HJ5R9vmb/BLfEkoGnNNbhTdi N1OCYXC92zEndjdig81BYDB1vA/PAeJYM3bkYeIX2XIJ0SHZsAWE03fxYjW8YHS7ZY+e lpOVRs/5UnX7uhxVyEVccn31oz/ENdGjqT/5GZdabHXXzNAlq/2mkDHBEzXeepS7HQrD leYg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1716447100; x=1717051900; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=o4q6ro1QAXiOUhbtOATI+747eS1KRPVsmhuOnnTVCRM=; b=vgSNo5qhqrulTIi5FpHlODBhOK9dSsNm+ZWHJV+q05uIwpMy6iFXZdm3tKlF4r5/zI 0eJF/Mrh3f8eTMmMc60GF2662cEF3sFQT4oYOE6uaCwWbzDtUEPy3fP2xVEFuJqyXCRa g65tor29q/vqYHlwxvF1qcBHL/VUSxVXaV5IufYLrgt/3fH3m0UshISwq1BNQ9N3v897 V+u4wE1M6VhfnU0je/PloMB8Ca2UGy+0h41VDCH5OWdYzbdfPgtuh+WF/EZ5pEf47hrn mWBTZAkp1vkRgM+29WjSrpeWzS9ZiJ87fUuqNcJtPd3O7meD1hs6ycxTmRH6Rl6GYhMa HIcQ== X-Gm-Message-State: AOJu0Ywo6zCbk65viLh31GupF1z5W0J86uYaT0W6OFNjvAih53dBo2Jp 7jCJpOxlxYv0JFuHsmPBtZ+lqhoxl9j9TWYDzFiD6YJSmjDBr18O0oXpM87GYGDrMHf6sd65ok4 Wfzqp/63BKxb4+SiUo/El21PvoHyZcxl3/e3Pgg== X-Google-Smtp-Source: AGHT+IEGVhKejDik93LCOac49+i2ieBIYTuTMXYjsNmw1sfkR9jW8blV4IWZS+Qt2lEfR1NmASHk2I7esRyPkoIHpxc= X-Received: by 2002:a2e:8606:0:b0:2e1:a106:3f5b with SMTP id 38308e7fff4ca-2e9494f6725mr33893291fa.27.1716447099664; Wed, 22 May 2024 23:51:39 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Muhammad Salahuddin Manzoor Date: Thu, 23 May 2024 11:51:28 +0500 Message-ID: Subject: Re: Long running query causing XID limit breach To: sud Cc: pgsql-general Content-Type: multipart/alternative; boundary="000000000000a6d61b0619197b86" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000a6d61b0619197b86 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Greetings, Yes, monitoring and alerting for VACUUM operations are crucial. Track VACUUM Duration and Success: SELECT pid, state, query_start, now() - query_start AS duration, query FROM pg_stat_activity WHERE query LIKE 'VACUUM%' ORDER BY duration DESC; Check Autovacuum Activity: SELECT relname, last_vacuum, last_autovacuum, vacuum_count, autovacuum_coun= t FROM pg_stat_all_tables WHERE relname =3D 'your_table_name'; Log and Monitor VACUUM Failures: log_autovacuum_min_duration =3D 0 # Log all autovacuum actions log_min_messages =3D 'WARNING' # Ensure warnings and above are logged Use tools like pgBadger to analyze PostgreSQL logs and identify any issues with autovacuum operations. Set Up Alerts for Long-Running VACUUMs: Use monitoring tools such as pgMonitor, Nagios, Prometheus with Grafana, or New Relic to set up alerts for long-running VACUUM processes. Yes, your understanding is correct. In a high-transaction environment like yours, long-running transactions, including legitimate reporting queries, can hold back the advancement of the transaction ID (XID) horizon. This can prevent VACUUM from properly cleaning up old XIDs, leading to the risk of XID wraparound and potential system failure. Use some Mitigation Strategies to handle long running quires like Set Transaction Timeouts, Monitor and Kill Long-Running Queries, Optimize Query Performance, Schedule heavy reporting queries during periods of lower transactional activity. *Salahuddin (=EC=82=B4=EB=9D=BC=ED=9B=84=EB=94=98* *)* On Thu, 23 May 2024 at 11:25, sud wrote: > Also,if i am getting it correct, it means we should not run any > transaction (even if it's legitimate one like for e.g. a big Reporting > "SELECT" query) beyond 10hrs, as that will end up consuming 10*200million > XID per hour=3D 2billion XID limit saturation and thus causing system > failure. Hope my understanding is correct here. > > On Thu, May 23, 2024 at 11:41=E2=80=AFAM sud wrote: > >> >> On Thu, May 23, 2024 at 10:42=E2=80=AFAM Muhammad Salahuddin Manzoor < >> salahuddin.m@bitnine.net> wrote: >> >>> Greetings, >>> >>> Running `VACUUM table_name;` on a partitioned table will vacuum each >>> partition individually, not the whole table as a single unit. >>> >>> Yes, running `VACUUM table_name;` frequently on tables or partitions >>> with heavy DML is recommended. >>> >>> Regular `VACUUM` does not lock the table for reads or writes, so it >>> won't disrupt ongoing 24/7 data operations. >>> >>> "optimize autovacuum" >>> Yes. Adjust following parameters as per your system/environment >>> requirement >>> autovacuum_max_workers, >>> autovacuum_freeze_max_age , >>> autovacuum_vacuum_cost_delay >>> >>> Following need to be first tested thoroughly in a test environment. >>> Recommended Alert Threshold >>> Alert at 50% Usage: Set the alert threshold at 1 billion used XIDs. Thi= s >>> provides a significant buffer, giving you ample time to take corrective >>> action before reaching the critical limit. >>> >>> Calculation Rationale >>> Daily XID Usage: Approximately 4 billion rows per day implies high XID >>> consumption. >>> Buffer Time: At 1 billion XIDs, you would still have 1 billion XIDs >>> remaining, giving you roughly 12 hours to address the issue if your sys= tem >>> consumes 200 million XIDs per hour. >>> >>> >>> >> Thank you so much. That helps. >> So apart from setting these alerts on "Maximumusedtxnids" and making the >> vacuum optimized by tweaking above parameters, should we also need to ha= ve >> monitoring in place to ensure the Vacuum is not taking longer as compare= d >> to its normal runtime and also if it's getting blocked/failed by somethi= ng? >> Like for example in our case where the select query was running longer ,= so >> the vacuum must not be able to succeed every time it attempts, so is it >> really worth having that level of alerting? and also how can we get an >> idea regarding if the vacuum is not succeeding or getting failed etc to >> avoid such upcoming issues? >> >> --000000000000a6d61b0619197b86 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Greetings,

Yes, monitoring and al= erting for VACUUM operations are crucial.

Track VACUUM Duration and = Success:

SELECT pid, state, query_start, now() - query_start AS dura= tion, query
FROM pg_stat_activity
WHERE query LIKE 'VACUUM%'<= br>ORDER BY duration DESC;

Check Autovacuum Activity:

SELECT = relname, last_vacuum, last_autovacuum, vacuum_count, autovacuum_count
FR= OM pg_stat_all_tables
WHERE relname =3D 'your_table_name';
Log and Monitor VACUUM Failures:

log_autovacuum_min_duration =3D 0= =C2=A0# Log all autovacuum actions
log_min_messages =3D 'WARNING= 9; =C2=A0 =C2=A0 # Ensure warnings and above are logged

Use tools li= ke pgBadger to analyze PostgreSQL logs and identify any issues with autovac= uum operations.

Set Up Alerts for Long-Running VACUUMs:

Use m= onitoring tools such as pgMonitor, Nagios, Prometheus with Grafana, or New = Relic to set up alerts for long-running VACUUM processes.

Yes, your = understanding is correct. In a high-transaction environment like yours, lon= g-running transactions, including legitimate reporting queries, can hold ba= ck the advancement of the transaction ID (XID) horizon. This can prevent VA= CUUM from properly cleaning up old XIDs, leading to the risk of XID wraparo= und and potential system failure.

Use some Mitigation Strategies to = handle long running quires like Set Transaction Timeouts, Monitor and Kill = Long-Running Queries, Optimize Query Performance, Schedule heavy reporting = queries during periods of lower transactional activity.

Salahuddin (=EC= =82=B4=EB=9D=BC=ED=9B=84=EB=94=98)



On Thu, 23 May 2024 at 11:25, sud <suds1434@gmail.com> wrote:
Also,= if i am getting it correct, it means we should not run any transaction (eve= n if it's legitimate one like for e.g. a big Reporting "SELECT&quo= t; query) beyond 10hrs, as that will end up consuming 10*200million XID per= hour=3D 2billion XID limit saturation and thus causing system failure. Hop= e my understanding=C2=A0is correct here.

On Thu, May 23, 2024 at 11:41=E2=80= =AFAM sud <suds1= 434@gmail.com> wrote:

On Thu, May 23, 2024 at 10:4= 2=E2=80=AFAM Muhammad Salahuddin Manzoor <salahuddin.m@bitnine.net> wrote:
=
Gr= eetings,

Running `VACUUM table_name;` on a partitioned t= able will vacuum each partition individually, not the whole table as a sing= le unit.

Yes, running `VACUUM table_name;` frequently on tables or p= artitions with heavy DML is recommended.

Regular `VACUUM` does not l= ock the table for reads or writes, so it won't disrupt ongoing 24/7 dat= a operations.

"optimize autovacuum"
Yes. Adjust followi= ng parameters as per your system/environment requirement
autovacuum_max= _workers,=C2=A0
autovacuum_freeze_max_age ,=C2=A0
autov= acuum_vacuum_cost_delay

Following need to be first tested thoroughly= in a test environment.
Recommended Alert Threshold
Alert at 50% Usag= e: Set the alert threshold at 1 billion used XIDs. This provides a signific= ant buffer, giving you ample time to take corrective action before reaching= the critical limit.

Calculation Rationale
Daily XID Usage: Appro= ximately 4 billion rows per day implies high XID consumption.
Buffer Tim= e: At 1 billion XIDs, you would still have 1 billion XIDs remaining, giving= you roughly 12 hours to address the issue if your system consumes 200 mill= ion XIDs per hour.

<= div dir=3D"ltr">

Thank you so much. That helps.
So apart from setting these alerts on= "Maximumusedtxnids" and making the vacuum optimized by tweaking = above parameters, should we also need to have monitoring in place to ensure= =C2=A0the Vacuum is not taking longer as compared to its normal runtime and= also if it's getting blocked/failed by something? Like for example in = our case where the select query was running longer , so the vacuum=C2=A0mus= t not be able to succeed every time it attempts, so is it really worth havi= ng=C2=A0that level of alerting?=C2=A0 and also how can we get an idea regar= ding if the vacuum is not succeeding or getting failed etc to avoid such up= coming issues?=C2=A0

--000000000000a6d61b0619197b86--