Long running query causing XID limit breach

public inbox for [email protected]  
help / color / mirror / Atom feed

Long running query causing XID limit breach
24+ messages / 5 participants
[nested] [flat]

* Long running query causing XID limit breach
@ 2024-05-22 21:16 sud <[email protected]>
  2024-05-23 03:29 ` Re: Long running query causing XID limit breach Muhammad Salahuddin Manzoor <[email protected]>
  2024-05-23 07:52 ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  0 siblings, 2 replies; 24+ messages in thread

From: sud @ 2024-05-22 21:16 UTC (permalink / raw)
  To: pgsql-general <[email protected]>

Hello ,
It's RDS postgres version 15.4. We suddenly saw the
"MaximumUsedTransactionIDs" reach to ~1.5billion and got alerted by team
members who mentioned the database is going to be in shutdown/hung if this
value reaches to ~2billion and won't be able to serve any incoming
transactions. It was a panic situation.

I have heard of it before , because of the way postgres works and the XID
being a datatype of length 32 bit integer can only represent (2^32)/2=~2
billion transactions. However, as RDS performs the auto vacuum , we thought
that we need not worry about this issue. But it seems we were wrong. And we
found one adhoc "SELECT '' query was running on the reader instance since
the last couple of days and when that was killed, the max xid
(MaximumUsedTransactionIDs) dropped to 50million immediately.

So I have few questions,

1)This system is going to be a 24/7 up and running system which will
process ~500million business transactions/day in future i.e. ~4-5billion
rows/day inserted across multiple tables each day. And as I understand each
row will have XID allocated. So in that case , does it mean that, we will
need (5billion/24)=~200million XID/hour and thus , if any such legitimate
application "SELECT" query keeps running for ~10 hours (and thus keep the
historical XID alive) , then it can saturate the
"MaximumUsedTransactionIDs" and make the database standstill in
2billion/200million=~10hrs. Is this understanding correct? Seems we are
prone to hit this limit sooner going forward.

2)We have some legitimate cases where the reporting queries can run for
5-6hrs. So in such cases if the start of this SELECT query happen at 100th
XID on table TAB1, then whatever transactions happen after that time,
across all other tables(table2, table3 etc) in the database won't get
vacuum until that SELECT query on table1 get vacuumed(as database will try
to keep that same 100th XID image) and the XID will just keep incrementing
for new transaction, eventually reaching the max limit. Is my understanding
correct here?

3)Although RDS does the auto vacuum by default. but should we also consider
doing manual vacuum without impacting ongoing transactions? Something as
below options
vacuum freeze tab1;
vacuum freeze;
vacuum;
vacuum analyze tab1;
vacuum tab1;

4)Had worked in past in oracle database where the similar transaction
identifier is called as "system change number" , but never encountered that
being exhausted and also there it used to have UNDO record and if a SELECT
query needs anything beyond certain limit(set undo_retention parameter) the
select query used to fail with snapshot too old error but not impacting any
write transactions. But in postgres it seems nothing like that happens and
every "Select query" will try to run till its completion without any such
failure, until it gets skilled by someone. Is my understanding correct?

 And in that case, It seems we have to mandatorily set "statement_timeout"
to some value e.g. 4hrs(also i am not seeing a way to set it for any
specific user level, so it will be set for all queries including
application level) and also "idle_in_transaction_session_timeout" to
5minutes, even on all the prod and non prod databases, to restrict the long
running transactions/queries and avoid such issues in future. Correct me if
I'm wrong.

Regards
Sud

^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Long running query causing XID limit breach
  2024-05-22 21:16 Long running query causing XID limit breach sud <[email protected]>
@ 2024-05-23 03:29 ` Muhammad Salahuddin Manzoor <[email protected]>
  2024-05-23 04:48   ` Re: Long running query causing XID limit breach sud <[email protected]>
  1 sibling, 1 reply; 24+ messages in thread

From: Muhammad Salahuddin Manzoor @ 2024-05-23 03:29 UTC (permalink / raw)
  To: sud <[email protected]>; +Cc: pgsql-general <[email protected]>

Greetings,

In high-transaction environments like yours, it may be necessary to
supplement this with manual vacuuming.

Few Recommendations

Monitor Long-Running Queries try to optimize.
Optimize Autovacuum.
Partitioning.
Adopt Vacuum Strategy after peak hours.

*Salahuddin (살라후딘**)*




On Thu, 23 May 2024 at 02:16, sud <[email protected]> wrote:

> Hello ,
> It's RDS postgres version 15.4. We suddenly saw the
> "MaximumUsedTransactionIDs" reach to ~1.5billion and got alerted by team
> members who mentioned the database is going to be in shutdown/hung if this
> value reaches to ~2billion and won't be able to serve any incoming
> transactions. It was a panic situation.
>
> I have heard of it before , because of the way postgres works and the XID
> being a datatype of length 32 bit integer can only represent (2^32)/2=~2
> billion transactions. However, as RDS performs the auto vacuum , we thought
> that we need not worry about this issue. But it seems we were wrong. And we
> found one adhoc "SELECT '' query was running on the reader instance since
> the last couple of days and when that was killed, the max xid
> (MaximumUsedTransactionIDs) dropped to 50million immediately.
>
> So I have few questions,
>
> 1)This system is going to be a 24/7 up and running system which will
> process ~500million business transactions/day in future i.e. ~4-5billion
> rows/day inserted across multiple tables each day. And as I understand each
> row will have XID allocated. So in that case , does it mean that, we will
> need (5billion/24)=~200million XID/hour and thus , if any such legitimate
> application "SELECT" query keeps running for ~10 hours (and thus keep the
> historical XID alive) , then it can saturate the
> "MaximumUsedTransactionIDs" and make the database standstill in
> 2billion/200million=~10hrs. Is this understanding correct? Seems we are
> prone to hit this limit sooner going forward.
>
> 2)We have some legitimate cases where the reporting queries can run for
> 5-6hrs. So in such cases if the start of this SELECT query happen at 100th
> XID on table TAB1, then whatever transactions happen after that time,
> across all other tables(table2, table3 etc) in the database won't get
> vacuum until that SELECT query on table1 get vacuumed(as database will try
> to keep that same 100th XID image) and the XID will just keep incrementing
> for new transaction, eventually reaching the max limit. Is my understanding
> correct here?
>
> 3)Although RDS does the auto vacuum by default. but should we also
> consider doing manual vacuum without impacting ongoing transactions?
> Something as below options
> vacuum freeze tab1;
> vacuum freeze;
> vacuum;
> vacuum analyze tab1;
> vacuum tab1;
>
> 4)Had worked in past in oracle database where the similar transaction
> identifier is called as "system change number" , but never encountered that
> being exhausted and also there it used to have UNDO record and if a SELECT
> query needs anything beyond certain limit(set undo_retention parameter) the
> select query used to fail with snapshot too old error but not impacting any
> write transactions. But in postgres it seems nothing like that happens and
> every "Select query" will try to run till its completion without any such
> failure, until it gets skilled by someone. Is my understanding correct?
>
>  And in that case, It seems we have to mandatorily set "statement_timeout"
> to some value e.g. 4hrs(also i am not seeing a way to set it for any
> specific user level, so it will be set for all queries including
> application level) and also "idle_in_transaction_session_timeout" to
> 5minutes, even on all the prod and non prod databases, to restrict the long
> running transactions/queries and avoid such issues in future. Correct me if
> I'm wrong.
>
> Regards
> Sud
>


^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Long running query causing XID limit breach
  2024-05-22 21:16 Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 03:29 ` Re: Long running query causing XID limit breach Muhammad Salahuddin Manzoor <[email protected]>
@ 2024-05-23 04:48   ` sud <[email protected]>
  2024-05-23 05:12     ` Re: Long running query causing XID limit breach Muhammad Salahuddin Manzoor <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: sud @ 2024-05-23 04:48 UTC (permalink / raw)
  To: Muhammad Salahuddin Manzoor <[email protected]>; +Cc: pgsql-general <[email protected]>

On Thu, May 23, 2024 at 9:00 AM Muhammad Salahuddin Manzoor <
[email protected]> wrote:

> Greetings,
>
> In high-transaction environments like yours, it may be necessary to
> supplement this with manual vacuuming.
>
> Few Recommendations
>
> Monitor Long-Running Queries try to optimize.
> Optimize Autovacuum.
> Partitioning.
> Adopt Vacuum Strategy after peak hours.
>
> We have these big tables already partitioned. So does "vacuum table_name"
will endup scanning whole table or just the latest/live partition which is
getting loaded currently? and do you mean to say running command "vacuum
table_name;" frequently on selective tables that are experiencing heavy DML
? Hope this won't lock the table anyway because the data will be
written/read from these tables 24/7.

When you say, "optimize autovacuum" does it mean to set a higher value
of "autovacuum_max_workers"
and "autovacuum_freeze_max_age"?

Considering we have ~4 billion rows inserted daily into the table and there
is limit of ~2billion to the "Maximumusedtxnids", what threshold should we
set for the alerting and to have enough time at hand to fix this issue?


^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Long running query causing XID limit breach
  2024-05-22 21:16 Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 03:29 ` Re: Long running query causing XID limit breach Muhammad Salahuddin Manzoor <[email protected]>
  2024-05-23 04:48   ` Re: Long running query causing XID limit breach sud <[email protected]>
@ 2024-05-23 05:12     ` Muhammad Salahuddin Manzoor <[email protected]>
  2024-05-23 06:11       ` Re: Long running query causing XID limit breach sud <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: Muhammad Salahuddin Manzoor @ 2024-05-23 05:12 UTC (permalink / raw)
  To: sud <[email protected]>; +Cc: pgsql-general <[email protected]>

Greetings,

Running `VACUUM table_name;` on a partitioned table will vacuum each
partition individually, not the whole table as a single unit.

Yes, running `VACUUM table_name;` frequently on tables or partitions with
heavy DML is recommended.

Regular `VACUUM` does not lock the table for reads or writes, so it won't
disrupt ongoing 24/7 data operations.

"optimize autovacuum"
Yes. Adjust following parameters as per your system/environment requirement
autovacuum_max_workers,
autovacuum_freeze_max_age ,
autovacuum_vacuum_cost_delay

Following need to be first tested thoroughly in a test environment.
Recommended Alert Threshold
Alert at 50% Usage: Set the alert threshold at 1 billion used XIDs. This
provides a significant buffer, giving you ample time to take corrective
action before reaching the critical limit.

Calculation Rationale
Daily XID Usage: Approximately 4 billion rows per day implies high XID
consumption.
Buffer Time: At 1 billion XIDs, you would still have 1 billion XIDs
remaining, giving you roughly 12 hours to address the issue if your system
consumes 200 million XIDs per hour.

*Salahuddin (살라후딘**)*

On Thu, 23 May 2024 at 09:48, sud <[email protected]> wrote:

> On Thu, May 23, 2024 at 9:00 AM Muhammad Salahuddin Manzoor <
> [email protected]> wrote:
>
>> Greetings,
>>
>> In high-transaction environments like yours, it may be necessary to
>> supplement this with manual vacuuming.
>>
>> Few Recommendations
>>
>> Monitor Long-Running Queries try to optimize.
>> Optimize Autovacuum.
>> Partitioning.
>> Adopt Vacuum Strategy after peak hours.
>>
>> We have these big tables already partitioned. So does "vacuum table_name"
> will endup scanning whole table or just the latest/live partition which is
> getting loaded currently? and do you mean to say running command "vacuum
> table_name;" frequently on selective tables that are experiencing heavy DML
> ? Hope this won't lock the table anyway because the data will be
> written/read from these tables 24/7.
>
> When you say, "optimize autovacuum" does it mean to set a higher value of "autovacuum_max_workers"
> and "autovacuum_freeze_max_age"?
>
> Considering we have ~4 billion rows inserted daily into the table and
> there is limit of ~2billion to the "Maximumusedtxnids", what threshold
> should we set for the alerting and to have enough time at hand to fix this
> issue?
>
>

^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Long running query causing XID limit breach
  2024-05-22 21:16 Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 03:29 ` Re: Long running query causing XID limit breach Muhammad Salahuddin Manzoor <[email protected]>
  2024-05-23 04:48   ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 05:12     ` Re: Long running query causing XID limit breach Muhammad Salahuddin Manzoor <[email protected]>
@ 2024-05-23 06:11       ` sud <[email protected]>
  2024-05-23 06:25         ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 07:19         ` Re: Long running query causing XID limit breach yudhi s <[email protected]>
  0 siblings, 2 replies; 24+ messages in thread

From: sud @ 2024-05-23 06:11 UTC (permalink / raw)
  To: Muhammad Salahuddin Manzoor <[email protected]>; +Cc: pgsql-general <[email protected]>

On Thu, May 23, 2024 at 10:42 AM Muhammad Salahuddin Manzoor <
[email protected]> wrote:

> Greetings,
>
> Running `VACUUM table_name;` on a partitioned table will vacuum each
> partition individually, not the whole table as a single unit.
>
> Yes, running `VACUUM table_name;` frequently on tables or partitions with
> heavy DML is recommended.
>
> Regular `VACUUM` does not lock the table for reads or writes, so it won't
> disrupt ongoing 24/7 data operations.
>
> "optimize autovacuum"
> Yes. Adjust following parameters as per your system/environment
> requirement
> autovacuum_max_workers,
> autovacuum_freeze_max_age ,
> autovacuum_vacuum_cost_delay
>
> Following need to be first tested thoroughly in a test environment.
> Recommended Alert Threshold
> Alert at 50% Usage: Set the alert threshold at 1 billion used XIDs. This
> provides a significant buffer, giving you ample time to take corrective
> action before reaching the critical limit.
>
> Calculation Rationale
> Daily XID Usage: Approximately 4 billion rows per day implies high XID
> consumption.
> Buffer Time: At 1 billion XIDs, you would still have 1 billion XIDs
> remaining, giving you roughly 12 hours to address the issue if your system
> consumes 200 million XIDs per hour.
>
>
>
Thank you so much. That helps.
So apart from setting these alerts on "Maximumusedtxnids" and making the
vacuum optimized by tweaking above parameters, should we also need to have
monitoring in place to ensure the Vacuum is not taking longer as compared
to its normal runtime and also if it's getting blocked/failed by something?
Like for example in our case where the select query was running longer , so
the vacuum must not be able to succeed every time it attempts, so is it
really worth having that level of alerting?  and also how can we get an
idea regarding if the vacuum is not succeeding or getting failed etc to
avoid such upcoming issues?


^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Long running query causing XID limit breach
  2024-05-22 21:16 Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 03:29 ` Re: Long running query causing XID limit breach Muhammad Salahuddin Manzoor <[email protected]>
  2024-05-23 04:48   ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 05:12     ` Re: Long running query causing XID limit breach Muhammad Salahuddin Manzoor <[email protected]>
  2024-05-23 06:11       ` Re: Long running query causing XID limit breach sud <[email protected]>
@ 2024-05-23 06:25         ` sud <[email protected]>
  2024-05-23 06:51           ` Re: Long running query causing XID limit breach Muhammad Salahuddin Manzoor <[email protected]>
  1 sibling, 1 reply; 24+ messages in thread

From: sud @ 2024-05-23 06:25 UTC (permalink / raw)
  To: Muhammad Salahuddin Manzoor <[email protected]>; +Cc: pgsql-general <[email protected]>

Also,if i am getting it correct, it means we should not run any transaction
(even if it's legitimate one like for e.g. a big Reporting "SELECT" query)
beyond 10hrs, as that will end up consuming 10*200million XID per hour=
2billion XID limit saturation and thus causing system failure. Hope my
understanding is correct here.

On Thu, May 23, 2024 at 11:41 AM sud <[email protected]> wrote:

>
> On Thu, May 23, 2024 at 10:42 AM Muhammad Salahuddin Manzoor <
> [email protected]> wrote:
>
>> Greetings,
>>
>> Running `VACUUM table_name;` on a partitioned table will vacuum each
>> partition individually, not the whole table as a single unit.
>>
>> Yes, running `VACUUM table_name;` frequently on tables or partitions with
>> heavy DML is recommended.
>>
>> Regular `VACUUM` does not lock the table for reads or writes, so it won't
>> disrupt ongoing 24/7 data operations.
>>
>> "optimize autovacuum"
>> Yes. Adjust following parameters as per your system/environment
>> requirement
>> autovacuum_max_workers,
>> autovacuum_freeze_max_age ,
>> autovacuum_vacuum_cost_delay
>>
>> Following need to be first tested thoroughly in a test environment.
>> Recommended Alert Threshold
>> Alert at 50% Usage: Set the alert threshold at 1 billion used XIDs. This
>> provides a significant buffer, giving you ample time to take corrective
>> action before reaching the critical limit.
>>
>> Calculation Rationale
>> Daily XID Usage: Approximately 4 billion rows per day implies high XID
>> consumption.
>> Buffer Time: At 1 billion XIDs, you would still have 1 billion XIDs
>> remaining, giving you roughly 12 hours to address the issue if your system
>> consumes 200 million XIDs per hour.
>>
>>
>>
> Thank you so much. That helps.
> So apart from setting these alerts on "Maximumusedtxnids" and making the
> vacuum optimized by tweaking above parameters, should we also need to have
> monitoring in place to ensure the Vacuum is not taking longer as compared
> to its normal runtime and also if it's getting blocked/failed by something?
> Like for example in our case where the select query was running longer , so
> the vacuum must not be able to succeed every time it attempts, so is it
> really worth having that level of alerting?  and also how can we get an
> idea regarding if the vacuum is not succeeding or getting failed etc to
> avoid such upcoming issues?
>
>


^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Long running query causing XID limit breach
  2024-05-22 21:16 Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 03:29 ` Re: Long running query causing XID limit breach Muhammad Salahuddin Manzoor <[email protected]>
  2024-05-23 04:48   ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 05:12     ` Re: Long running query causing XID limit breach Muhammad Salahuddin Manzoor <[email protected]>
  2024-05-23 06:11       ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 06:25         ` Re: Long running query causing XID limit breach sud <[email protected]>
@ 2024-05-23 06:51           ` Muhammad Salahuddin Manzoor <[email protected]>
  0 siblings, 0 replies; 24+ messages in thread

From: Muhammad Salahuddin Manzoor @ 2024-05-23 06:51 UTC (permalink / raw)
  To: sud <[email protected]>; +Cc: pgsql-general <[email protected]>

Greetings,

Yes, monitoring and alerting for VACUUM operations are crucial.

Track VACUUM Duration and Success:

SELECT pid, state, query_start, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE query LIKE 'VACUUM%'
ORDER BY duration DESC;

Check Autovacuum Activity:

SELECT relname, last_vacuum, last_autovacuum, vacuum_count, autovacuum_count
FROM pg_stat_all_tables
WHERE relname = 'your_table_name';

Log and Monitor VACUUM Failures:

log_autovacuum_min_duration = 0  # Log all autovacuum actions
log_min_messages = 'WARNING'     # Ensure warnings and above are logged

Use tools like pgBadger to analyze PostgreSQL logs and identify any issues
with autovacuum operations.

Set Up Alerts for Long-Running VACUUMs:

Use monitoring tools such as pgMonitor, Nagios, Prometheus with Grafana, or
New Relic to set up alerts for long-running VACUUM processes.

Yes, your understanding is correct. In a high-transaction environment like
yours, long-running transactions, including legitimate reporting queries,
can hold back the advancement of the transaction ID (XID) horizon. This can
prevent VACUUM from properly cleaning up old XIDs, leading to the risk of
XID wraparound and potential system failure.

Use some Mitigation Strategies to handle long running quires like Set
Transaction Timeouts, Monitor and Kill Long-Running Queries, Optimize Query
Performance, Schedule heavy reporting queries during periods of lower
transactional activity.

*Salahuddin (살라후딘*
*)*


On Thu, 23 May 2024 at 11:25, sud <[email protected]> wrote:

> Also,if i am getting it correct, it means we should not run any
> transaction (even if it's legitimate one like for e.g. a big Reporting
> "SELECT" query) beyond 10hrs, as that will end up consuming 10*200million
> XID per hour= 2billion XID limit saturation and thus causing system
> failure. Hope my understanding is correct here.
>
> On Thu, May 23, 2024 at 11:41 AM sud <[email protected]> wrote:
>
>>
>> On Thu, May 23, 2024 at 10:42 AM Muhammad Salahuddin Manzoor <
>> [email protected]> wrote:
>>
>>> Greetings,
>>>
>>> Running `VACUUM table_name;` on a partitioned table will vacuum each
>>> partition individually, not the whole table as a single unit.
>>>
>>> Yes, running `VACUUM table_name;` frequently on tables or partitions
>>> with heavy DML is recommended.
>>>
>>> Regular `VACUUM` does not lock the table for reads or writes, so it
>>> won't disrupt ongoing 24/7 data operations.
>>>
>>> "optimize autovacuum"
>>> Yes. Adjust following parameters as per your system/environment
>>> requirement
>>> autovacuum_max_workers,
>>> autovacuum_freeze_max_age ,
>>> autovacuum_vacuum_cost_delay
>>>
>>> Following need to be first tested thoroughly in a test environment.
>>> Recommended Alert Threshold
>>> Alert at 50% Usage: Set the alert threshold at 1 billion used XIDs. This
>>> provides a significant buffer, giving you ample time to take corrective
>>> action before reaching the critical limit.
>>>
>>> Calculation Rationale
>>> Daily XID Usage: Approximately 4 billion rows per day implies high XID
>>> consumption.
>>> Buffer Time: At 1 billion XIDs, you would still have 1 billion XIDs
>>> remaining, giving you roughly 12 hours to address the issue if your system
>>> consumes 200 million XIDs per hour.
>>>
>>>
>>>
>> Thank you so much. That helps.
>> So apart from setting these alerts on "Maximumusedtxnids" and making the
>> vacuum optimized by tweaking above parameters, should we also need to have
>> monitoring in place to ensure the Vacuum is not taking longer as compared
>> to its normal runtime and also if it's getting blocked/failed by something?
>> Like for example in our case where the select query was running longer , so
>> the vacuum must not be able to succeed every time it attempts, so is it
>> really worth having that level of alerting?  and also how can we get an
>> idea regarding if the vacuum is not succeeding or getting failed etc to
>> avoid such upcoming issues?
>>
>>


^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Long running query causing XID limit breach
  2024-05-22 21:16 Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 03:29 ` Re: Long running query causing XID limit breach Muhammad Salahuddin Manzoor <[email protected]>
  2024-05-23 04:48   ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 05:12     ` Re: Long running query causing XID limit breach Muhammad Salahuddin Manzoor <[email protected]>
  2024-05-23 06:11       ` Re: Long running query causing XID limit breach sud <[email protected]>
@ 2024-05-23 07:19         ` yudhi s <[email protected]>
  1 sibling, 0 replies; 24+ messages in thread

From: yudhi s @ 2024-05-23 07:19 UTC (permalink / raw)
  To: sud <[email protected]>; +Cc: Muhammad Salahuddin Manzoor <[email protected]>; pgsql-general <[email protected]>

On Thu, May 23, 2024 at 11:42 AM sud <[email protected]> wrote:

>
>> Calculation Rationale
>> Daily XID Usage: Approximately 4 billion rows per day implies high XID
>> consumption.
>> Buffer Time: At 1 billion XIDs, you would still have 1 billion XIDs
>> remaining, giving you roughly 12 hours to address the issue if your system
>> consumes 200 million XIDs per hour.
>>
>>
>>
>
> OP mentioned that initially the number of business transactions is around
500million but the rows inserted across many tables are ~4-5billion in
total per day. So doesn't it mean that the XID consumption will happen
based on the transactions rather on the number of rows basis. Say
for example ~4billion rows may be loaded using a batch size of ~1000 using
bulk load, so that will be ~4million txn so it should use ~4million XID but
not 4 billion XID usage. And thus making the transactions process in
batches rather than row by row minimizes the XID usage. Correct?


^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Long running query causing XID limit breach
  2024-05-22 21:16 Long running query causing XID limit breach sud <[email protected]>
@ 2024-05-23 07:52 ` Laurenz Albe <[email protected]>
  2024-05-23 08:11   ` Re: Long running query causing XID limit breach sud <[email protected]>
  1 sibling, 1 reply; 24+ messages in thread

From: Laurenz Albe @ 2024-05-23 07:52 UTC (permalink / raw)
  To: sud <[email protected]>; pgsql-general <[email protected]>

On Thu, 2024-05-23 at 02:46 +0530, sud wrote:
> It's RDS postgres version 15.4. We suddenly saw the "MaximumUsedTransactionIDs"
> reach to ~1.5billion and got alerted by team members who mentioned the database
> is going to be in shutdown/hung if this value reaches to ~2billion and won't be
> able to serve any incoming transactions. It was a panic situation.
> 
> I have heard of it before , because of the way postgres works and the XID being
> a datatype of length 32 bit integer can only represent (2^32)/2=~2 billion
> transactions. However, as RDS performs the auto vacuum , we thought that we need
> not worry about this issue. But it seems we were wrong. And we found one adhoc
> "SELECT '' query was running on the reader instance since the last couple of
> days and when that was killed, the max xid (MaximumUsedTransactionIDs) dropped
> to 50million immediately.

This has nothing to do with autovacuum running.
PostgreSQL won't freeze any rows above the xmin horizon (see the "backend_xmin"
column in "pg_stat_activity").

> So I have few questions,
> 
> 1)This system is going to be a 24/7 up and running system which will process
>   ~500million business transactions/day in future i.e. ~4-5billion rows/day
>   inserted across multiple tables each day. And as I understand each row will
>   have XID allocated. So in that case , does it mean that, we will need
>   (5billion/24)=~200million XID/hour and thus , if any such legitimate
>   application "SELECT" query keeps running for ~10 hours (and thus keep the
>   historical XID alive) , then it can saturate the "MaximumUsedTransactionIDs"
>   and make the database standstill in 2billion/200million=~10hrs. Is this
>   understanding correct? Seems we are prone to hit this limit sooner going forward.

Yes, that is correct.  You cannot run such long-running queries with a
transaction rate like that.

> 2)We have some legitimate cases where the reporting queries can run for 5-6hrs.
>   So in such cases if the start of this SELECT query happen at 100th XID on
>   table TAB1, then whatever transactions happen after that time, across all
>   other tables(table2, table3 etc) in the database won't get vacuum until that
>   SELECT query on table1 get vacuumed(as database will try to keep that same
>   100th XID image) and the XID will just keep incrementing for new transaction,
>   eventually reaching the max limit. Is my understanding correct here?

Again correct.  PostgreSQL cannot tell which tables the query will use in the
future, so VACUUM cannot clean up old row versions in any table in the database.
This is irrelevant for transaction ID wraparound, though: you'll get into trouble
even if only a single table holds an unfrozen row that is old enough.

> 3)Although RDS does the auto vacuum by default. but should we also consider
>   doing manual vacuum without impacting ongoing transactions?

That won't help.  Autovacuum is running, but cannot freeze the rows, and a manual
VACUUM cannot do it either.

> 4)Had worked in past in oracle database [...]

Oracle implements all this radically differently.

>  And in that case, It seems we have to mandatorily set "statement_timeout" to
> some value e.g. 4hrs(also i am not seeing a way to set it for any specific
> user level, so it will be set for all queries including application level)
> and also "idle_in_transaction_session_timeout" to 5minutes, even on all the
> prod and non prod databases, to restrict the long running transactions/queries
> and avoid such issues in future. Correct me if I'm wrong.

That looks right.

One thing you could consider is running the long-running queries on a standby
server.  Replication will get delayed, and you have to keep all the WAL
around for the standby to catch up once the query is done, but it should work.
You'd set "max_streaming_standby_delay" to -1 on the standby.

Yours,
Laurenz Albe






^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Long running query causing XID limit breach
  2024-05-22 21:16 Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 07:52 ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
@ 2024-05-23 08:11   ` sud <[email protected]>
  2024-05-23 08:15     ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: sud @ 2024-05-23 08:11 UTC (permalink / raw)
  To: Laurenz Albe <[email protected]>; +Cc: pgsql-general <[email protected]>

On Thu, May 23, 2024 at 1:22 PM Laurenz Albe <[email protected]>
wrote:

> On Thu, 2024-05-23 at 02:46 +0530, sud wrote:
> > It's RDS postgres version 15.4. We suddenly saw the
> "MaximumUsedTransactionIDs"
> > reach to ~1.5billion and got alerted by team members who mentioned the
> database
> > is going to be in shutdown/hung if this value reaches to ~2billion and
> won't be
> > able to serve any incoming transactions. It was a panic situation.
> >
> > I have heard of it before , because of the way postgres works and the
> XID being
> > a datatype of length 32 bit integer can only represent (2^32)/2=~2
> billion
> > transactions. However, as RDS performs the auto vacuum , we thought that
> we need
> > not worry about this issue. But it seems we were wrong. And we found one
> adhoc
> > "SELECT '' query was running on the reader instance since the last
> couple of
> > days and when that was killed, the max xid (MaximumUsedTransactionIDs)
> dropped
> > to 50million immediately.
>
> This has nothing to do with autovacuum running.
> PostgreSQL won't freeze any rows above the xmin horizon (see the
> "backend_xmin"
> column in "pg_stat_activity").
>
> > So I have few questions,
> >
> > 1)This system is going to be a 24/7 up and running system which will
> process
> >   ~500million business transactions/day in future i.e. ~4-5billion
> rows/day
> >   inserted across multiple tables each day. And as I understand each row
> will
> >   have XID allocated. So in that case , does it mean that, we will need
> >   (5billion/24)=~200million XID/hour and thus , if any such legitimate
> >   application "SELECT" query keeps running for ~10 hours (and thus keep
> the
> >   historical XID alive) , then it can saturate the
> "MaximumUsedTransactionIDs"
> >   and make the database standstill in 2billion/200million=~10hrs. Is this
> >   understanding correct? Seems we are prone to hit this limit sooner
> going forward.
>
> Yes, that is correct.  You cannot run such long-running queries with a
> transaction rate like that.
>
>
When you mean transaction ,does it mean one commit ? For example if it's
inserting+committing ~1000 rows in one batch then all the 1000 rows will be
marked as one XID rather than 1000 different XID. and so we should look for
batch processing rather than row by row types processing. Is the
understanding correct?


> One thing you could consider is running the long-running queries on a
> standby
> server.  Replication will get delayed, and you have to keep all the WAL
> around for the standby to catch up once the query is done, but it should
> work.
> You'd set "max_streaming_standby_delay" to -1 on the standby.
>
>
We have the "Select query" running on a reader instance , but still the
writer instance was showing up "MaximumUsedTransactionIDs" reaching
1.5billion, so it means both the instance as part of same cluster so
sharing same XIDs, and as per your suggestion we should run this in
separate standby cluster altogether which does not share same XID. Is this
understanding correct? or it can be handled even with another reader
instance by just tweaking some other parameter so that they won't share the
same XID?


^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Long running query causing XID limit breach
  2024-05-22 21:16 Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 07:52 ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-23 08:11   ` Re: Long running query causing XID limit breach sud <[email protected]>
@ 2024-05-23 08:15     ` Laurenz Albe <[email protected]>
  2024-05-23 12:45       ` Re: Long running query causing XID limit breach sud <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: Laurenz Albe @ 2024-05-23 08:15 UTC (permalink / raw)
  To: sud <[email protected]>; +Cc: pgsql-general <[email protected]>

On Thu, 2024-05-23 at 13:41 +0530, sud wrote:
> > Yes, that is correct.  You cannot run such long-running queries with a
> > transaction rate like that.
> 
> When you mean transaction ,does it mean one commit ? For example if it's
> inserting+committing ~1000 rows in one batch then all the 1000 rows will
> be marked as one XID rather than 1000 different XID. and so we should look
> for batch processing rather than row by row types processing.
> Is the understanding correct?

Yes, that would help.

> > One thing you could consider is running the long-running queries on a standby
> > server.  Replication will get delayed, and you have to keep all the WAL
> > around for the standby to catch up once the query is done, but it should work.
> > You'd set "max_streaming_standby_delay" to -1 on the standby.
> 
> We have the "Select query" running on a reader instance , but still the writer
> instance was showing up "MaximumUsedTransactionIDs" reaching 1.5billion, so it
> means both the instance as part of same cluster so sharing same XIDs

If a long running query on the standby influences the primary, that means that
you have "hot_standby_feedback" set to "on".  Set it to "off".

Yours,
Laurenz Albe






^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Long running query causing XID limit breach
  2024-05-22 21:16 Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 07:52 ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-23 08:11   ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 08:15     ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
@ 2024-05-23 12:45       ` sud <[email protected]>
  2024-05-23 14:41         ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: sud @ 2024-05-23 12:45 UTC (permalink / raw)
  To: Laurenz Albe <[email protected]>; +Cc: pgsql-general <[email protected]>

On Thu, May 23, 2024 at 1:45 PM Laurenz Albe <[email protected]>
wrote:

>
>
> If a long running query on the standby influences the primary, that means
> that
> you have "hot_standby_feedback" set to "on".  Set it to "off".
>
>
> Will the setting up of "hot_standby_feedback" value to OFF will cause the
reader instance to give incorrect query results or unexpected query failure
which will be potential inconsistency between the writer and reader
instance, as because those XID's can be removed/cleaned by the writer node
even if its being read by the reader instance query. And it can have more
replication lag. So I'm wondering , if this setup is advisable one?


^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Long running query causing XID limit breach
  2024-05-22 21:16 Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 07:52 ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-23 08:11   ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 08:15     ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-23 12:45       ` Re: Long running query causing XID limit breach sud <[email protected]>
@ 2024-05-23 14:41         ` Laurenz Albe <[email protected]>
  2024-05-24 05:04           ` Re: Long running query causing XID limit breach sud <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: Laurenz Albe @ 2024-05-23 14:41 UTC (permalink / raw)
  To: sud <[email protected]>; +Cc: pgsql-general <[email protected]>

On Thu, 2024-05-23 at 18:15 +0530, sud wrote:
> On Thu, May 23, 2024 at 1:45 PM Laurenz Albe <[email protected]> wrote:
> > If a long running query on the standby influences the primary, that means that
> > you have "hot_standby_feedback" set to "on".  Set it to "off".
> 
> Will the setting up of "hot_standby_feedback" value to OFF will cause the
> reader instance to give incorrect query results or unexpected query failure
> which will be potential inconsistency between the writer and reader instance,
> as because those XID's can be removed/cleaned by the writer node even if its
> being read by the reader instance query. And it can have more replication lag.

There will never be incorrect query results.

It can happen that a query on the standby gets canceled if you don't set
"max_standby_streaming_delay" to -1, but that can happen even if
"hot_standby_feedback" is "on".  It just happens less often.

The effect of setting "max_standby_streaming_delay" to -1 will often be a
replication delay if you run a long query.  That's what you have to
accept if you want to execute long-running queries.

You will never be able to have both of the following:
- queries never get canceled
- there is no replication delay

> So I'm wondering , if this setup is advisable one?

I'd say yes.  Anyway, if doesn't look like you have an alternative if
you want to run queries that take longer than it takes your transaction
ID counter to wrap around.

Yours,
Laurenz Albe






^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Long running query causing XID limit breach
  2024-05-22 21:16 Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 07:52 ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-23 08:11   ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 08:15     ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-23 12:45       ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 14:41         ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
@ 2024-05-24 05:04           ` sud <[email protected]>
  2024-05-25 20:54             ` Re: Long running query causing XID limit breach yudhi s <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: sud @ 2024-05-24 05:04 UTC (permalink / raw)
  To: Laurenz Albe <[email protected]>; +Cc: pgsql-general <[email protected]>

On Thu, May 23, 2024 at 8:11 PM Laurenz Albe <[email protected]>
wrote:

> On Thu, 2024-05-23 at 18:15 +0530, sud wrote:
> > On Thu, May 23, 2024 at 1:45 PM Laurenz Albe <[email protected]>
> wrote:
> > > If a long running query on the standby influences the primary, that
> means that
> > > you have "hot_standby_feedback" set to "on".  Set it to "off".
> >
> > Will the setting up of "hot_standby_feedback" value to OFF will cause the
> > reader instance to give incorrect query results or unexpected query
> failure
> > which will be potential inconsistency between the writer and reader
> instance,
> > as because those XID's can be removed/cleaned by the writer node even if
> its
> > being read by the reader instance query. And it can have more
> replication lag.
>
> There will never be incorrect query results.
>
> It can happen that a query on the standby gets canceled if you don't set
> "max_standby_streaming_delay" to -1, but that can happen even if
> "hot_standby_feedback" is "on".  It just happens less often.
>
> The effect of setting "max_standby_streaming_delay" to -1 will often be a
> replication delay if you run a long query.  That's what you have to
> accept if you want to execute long-running queries.
>
> You will never be able to have both of the following:
> - queries never get canceled
> - there is no replication delay
>
>
>
I am trying to understand these two parameters and each time it looks a bit
confusing to me. If These two parameters complement or conflict with each
other.

Say for example, If we set hot_feedback_standby to ON (which is currently
set as default ON by the way), it will make the primary wait till the query
completion at standby and can cause such a high bump in XID in scenarios
where the query on standby runs for days(like in our current scenario which
happens). So we were thinking of setting it as OFF, to avoid
the transaction ID wrap around issue..

But as you also mentioned to set the "max_standby_streaming_delay" to -1
(which is currently set as 14 second in our case) ,it will wait infinitely
, till the query completes on the standby and wont apply the WAL which can
cause override of the XID which the standby query is reading from. But wont
this same behaviour be happening while we have hot_feedback_standby set as
"ON"?

But again for HA , in case primary down we should not be in big lag for the
standby and thus we want the standby also with minimal lag. And as you
mentioned there will never be incorrect results but at amx it will be query
cancellation, so I was thinking , if it's fine to just keep the
"hot_feedback_standby" as OFF and let the max_standby_streaming_delay set
as it is like 14 sec. Let me know your thoughts.

Basically below are the combinations, i am confused between..

hot_feedback_stanby ON and max_standby_streaming_delay=-1
or
hot_feedback_stanby OFF and max_standby_streaming_delay=-1
Or
hot_feedback_stanby ON and max_standby_streaming_delay=14 sec
Or
hot_feedback_stanby OFF and max_standby_streaming_delay=14 sec

^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Long running query causing XID limit breach
  2024-05-22 21:16 Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 07:52 ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-23 08:11   ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 08:15     ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-23 12:45       ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 14:41         ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-24 05:04           ` Re: Long running query causing XID limit breach sud <[email protected]>
@ 2024-05-25 20:54             ` yudhi s <[email protected]>
  2024-05-25 20:59               ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-27 09:20               ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  0 siblings, 2 replies; 24+ messages in thread

From: yudhi s @ 2024-05-25 20:54 UTC (permalink / raw)
  To: sud <[email protected]>; +Cc: Laurenz Albe <[email protected]>; pgsql-general <[email protected]>

On Fri, May 24, 2024 at 10:34 AM sud <[email protected]> wrote:

> I am trying to understand these two parameters and each time it looks a
> bit confusing to me. If These two parameters complement or conflict with
> each other.
>
> Say for example, If we set hot_feedback_standby to ON (which is currently
> set as default ON by the way), it will make the primary wait till the query
> completion at standby and can cause such a high bump in XID in scenarios
> where the query on standby runs for days(like in our current scenario which
> happens). So we were thinking of setting it as OFF, to avoid
> the transaction ID wrap around issue..
>
> But as you also mentioned to set the "max_standby_streaming_delay" to -1
> (which is currently set as 14 second in our case) ,it will wait infinitely
> , till the query completes on the standby and wont apply the WAL which can
> cause override of the XID which the standby query is reading from. But wont
> this same behaviour be happening while we have hot_feedback_standby set as
> "ON"?
>
> But again for HA , in case primary down we should not be in big lag for
> the standby and thus we want the standby also with minimal lag. And as you
> mentioned there will never be incorrect results but at amx it will be query
> cancellation, so I was thinking , if it's fine to just keep the
> "hot_feedback_standby" as OFF and let the max_standby_streaming_delay set
> as it is like 14 sec. Let me know your thoughts.
>
> Basically below are the combinations, i am confused between..
>
> hot_feedback_stanby ON and max_standby_streaming_delay=-1
> or
> hot_feedback_stanby OFF and max_standby_streaming_delay=-1
> Or
> hot_feedback_stanby ON and max_standby_streaming_delay=14 sec
> Or
> hot_feedback_stanby OFF and max_standby_streaming_delay=14 sec
>

As per my understanding here, this would be the behaviour. Others may
comment..

*hot_standby_feedback ON and max_standby_streaming_delay = -1:*
Ensures that long-running queries on the standby are not interrupted. The
primary waits indefinitely to avoid vacuuming rows needed by standby
queries.
But Can lead to significant replication lag and increased XID consumption
on the primary, potentially causing transaction ID wraparound issues.

*hot_standby_feedback OFF and max_standby_streaming_delay = -1:*
Ensures long-running queries on the standby are not interrupted. No
feedback is sent to the primary, reducing the risk of XID wraparound.
But The standby may fall significantly behind the primary, resulting in
high replication lag.

*hot_standby_feedback ON and max_standby_streaming_delay = 14 seconds:*
The primary prevents vacuuming rows needed by standby queries, reducing
query cancellations on the standby. The replication lag is limited to 14
seconds.
But Long-running queries on the standby that exceed 14 seconds may be
canceled, and the primary can still experience increased XID consumption.

*hot_standby_feedback OFF and max_standby_streaming_delay = 14 seconds:*
 Limits replication lag to 14 seconds and reduces XID consumption on the
primary. Queries on the standby exceeding 14 seconds are canceled.
but Long-running queries on the standby are more likely to be canceled due
to the lack of feedback to the primary.

^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Long running query causing XID limit breach
  2024-05-22 21:16 Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 07:52 ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-23 08:11   ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 08:15     ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-23 12:45       ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 14:41         ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-24 05:04           ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-25 20:54             ` Re: Long running query causing XID limit breach yudhi s <[email protected]>
@ 2024-05-25 20:59               ` sud <[email protected]>
  2024-05-26 08:13                 ` Re: Long running query causing XID limit breach Torsten Förtsch <[email protected]>
  1 sibling, 1 reply; 24+ messages in thread

From: sud @ 2024-05-25 20:59 UTC (permalink / raw)
  To: yudhi s <[email protected]>; +Cc: Laurenz Albe <[email protected]>; pgsql-general <[email protected]>

On Sun, May 26, 2024 at 2:24 AM yudhi s <[email protected]> wrote:

>
>
> *hot_standby_feedback ON and max_standby_streaming_delay = -1:*
> Ensures that long-running queries on the standby are not interrupted. The
> primary waits indefinitely to avoid vacuuming rows needed by standby
> queries.
> But Can lead to significant replication lag and increased XID consumption
> on the primary, potentially causing transaction ID wraparound issues.
>
>
> *hot_standby_feedback OFF and max_standby_streaming_delay = -1:*
> Ensures long-running queries on the standby are not interrupted. No
> feedback is sent to the primary, reducing the risk of XID wraparound.
> But The standby may fall significantly behind the primary, resulting in
> high replication lag.
>
>
>
> *hot_standby_feedback ON and max_standby_streaming_delay = 14 seconds:*
> The primary prevents vacuuming rows needed by standby queries, reducing
> query cancellations on the standby. The replication lag is limited to 14
> seconds.
> But Long-running queries on the standby that exceed 14 seconds may be
> canceled, and the primary can still experience increased XID consumption.
>
>
> *hot_standby_feedback OFF and max_standby_streaming_delay = 14 seconds:*
>  Limits replication lag to 14 seconds and reduces XID consumption on the
> primary. Queries on the standby exceeding 14 seconds are canceled.
> but Long-running queries on the standby are more likely to be canceled due
> to the lack of feedback to the primary.
>
>
>

Thank you so much.
Does it mean that the last one we should go for i.e. (*hot_standby_feedback
OFF and max_standby_streaming_delay = 14 seconds), *as because high
availability is also a key requirement in any production environment, so
keeping 14 seconds lag is kind of okay and also at the same time
keeping hot_standby_feedback OFF will make sure the transaction id
wraparound around won't happen because of any long running query on standby
as it won't wait for the stand by feedback for vacuuming the tables.

But i have one question here , does max_standby_streaming_delay = 14 ,
means the queries on the standby will get cancelled after 14 seconds?


^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Long running query causing XID limit breach
  2024-05-22 21:16 Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 07:52 ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-23 08:11   ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 08:15     ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-23 12:45       ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 14:41         ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-24 05:04           ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-25 20:54             ` Re: Long running query causing XID limit breach yudhi s <[email protected]>
  2024-05-25 20:59               ` Re: Long running query causing XID limit breach sud <[email protected]>
@ 2024-05-26 08:13                 ` Torsten Förtsch <[email protected]>
  2024-05-26 09:15                   ` Re: Long running query causing XID limit breach sud <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: Torsten Förtsch @ 2024-05-26 08:13 UTC (permalink / raw)
  To: sud <[email protected]>; +Cc: pgsql-general <[email protected]>

On Sat, May 25, 2024 at 11:00 PM sud <[email protected]> wrote:

>
> But i have one question here , does max_standby_streaming_delay = 14 ,
> means the queries on the standby will get cancelled after 14 seconds?
>

No, your query gets cancelled when it stalls replication for >14 sec. If
your master is idle and does not send any WAL and the replica has
caught up, the query can take as long as it wants.


^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Long running query causing XID limit breach
  2024-05-22 21:16 Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 07:52 ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-23 08:11   ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 08:15     ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-23 12:45       ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 14:41         ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-24 05:04           ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-25 20:54             ` Re: Long running query causing XID limit breach yudhi s <[email protected]>
  2024-05-25 20:59               ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-26 08:13                 ` Re: Long running query causing XID limit breach Torsten Förtsch <[email protected]>
@ 2024-05-26 09:15                   ` sud <[email protected]>
  2024-05-26 17:48                     ` Re: Long running query causing XID limit breach Torsten Förtsch <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: sud @ 2024-05-26 09:15 UTC (permalink / raw)
  To: Torsten Förtsch <[email protected]>; +Cc: pgsql-general <[email protected]>

On Sun, May 26, 2024 at 1:43 PM Torsten Förtsch <[email protected]>
wrote:

> On Sat, May 25, 2024 at 11:00 PM sud <[email protected]> wrote:
>
>>
>> But i have one question here , does max_standby_streaming_delay = 14 ,
>> means the queries on the standby will get cancelled after 14 seconds?
>>
>
> No, your query gets cancelled when it stalls replication for >14 sec. If
> your master is idle and does not send any WAL and the replica has
> caught up, the query can take as long as it wants.
>

Thank you so much.
For example , in below scenario,
if i have insert query going on on primary instance on table  25th may
partition of TABLE1, and at same time we are selecting data from 24th May
partition , then with "max_standby_streaming_delay = 14" setup , it just
allows the select query to run for any duration without any restriction
even if the WAL gets applied on the standby regularly. Also INSERT query in
primary won't make the standby SELECT queries to cancel as because the WAL
record of INSERT queries on the primary instance is not conflicting to the
exact rows those were being read by the standby. Is my understanding
correct here?

However, if i have Update/Delete query going on on primary instance on
table  25th may partition of TABLE1 and on the exact same set of rows which
were being read by the standby instance by the SELECT query, then the
application of such WAL record to standby can max wait for 14 seconds and
thus those select query are prone to be cancelled after 14 seconds. Is this
understanding correct?

If the above is true then it doesn't look good, as because in an OLTP
system there will be a lot of DMLS happening on the writer instances and
there may be many queries running on the reader/standby instances which are
meant to run for hours. And if we say making those SELECT queries run for
hours means compromising an hour of "high availability"/RPO or a lag of an
hour between primary and standby , that doesn't look good. Please
correct me if I am missing something here.

^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Long running query causing XID limit breach
  2024-05-22 21:16 Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 07:52 ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-23 08:11   ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 08:15     ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-23 12:45       ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 14:41         ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-24 05:04           ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-25 20:54             ` Re: Long running query causing XID limit breach yudhi s <[email protected]>
  2024-05-25 20:59               ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-26 08:13                 ` Re: Long running query causing XID limit breach Torsten Förtsch <[email protected]>
  2024-05-26 09:15                   ` Re: Long running query causing XID limit breach sud <[email protected]>
@ 2024-05-26 17:48                     ` Torsten Förtsch <[email protected]>
  2024-05-26 18:46                       ` Re: Long running query causing XID limit breach sud <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: Torsten Förtsch @ 2024-05-26 17:48 UTC (permalink / raw)
  To: sud <[email protected]>; +Cc: pgsql-general <[email protected]>

On Sun, May 26, 2024 at 11:16 AM sud <[email protected]> wrote:

> On Sun, May 26, 2024 at 1:43 PM Torsten Förtsch <[email protected]>
> wrote:
>
>> On Sat, May 25, 2024 at 11:00 PM sud <[email protected]> wrote:
>>
>>>
>>> But i have one question here , does max_standby_streaming_delay = 14 ,
>>> means the queries on the standby will get cancelled after 14 seconds?
>>>
>>
>> No, your query gets cancelled when it stalls replication for >14 sec. If
>> your master is idle and does not send any WAL and the replica has
>> caught up, the query can take as long as it wants.
>>
>
> Thank you so much.
> For example , in below scenario,
> if i have insert query going on on primary instance on table  25th may
> partition of TABLE1, and at same time we are selecting data from 24th May
> partition , then with "max_standby_streaming_delay = 14" setup , it just
> allows the select query to run for any duration without any restriction
> even if the WAL gets applied on the standby regularly. Also INSERT query in
> primary won't make the standby SELECT queries to cancel as because the WAL
> record of INSERT queries on the primary instance is not conflicting to the
> exact rows those were being read by the standby. Is my understanding
> correct here?
>
> However, if i have Update/Delete query going on on primary instance on
> table  25th may partition of TABLE1 and on the exact same set of rows which
> were being read by the standby instance by the SELECT query, then the
> application of such WAL record to standby can max wait for 14 seconds and
> thus those select query are prone to be cancelled after 14 seconds. Is this
> understanding correct?
>
> If the above is true then it doesn't look good, as because in an OLTP
> system there will be a lot of DMLS happening on the writer instances and
> there may be many queries running on the reader/standby instances which are
> meant to run for hours. And if we say making those SELECT queries run for
> hours means compromising an hour of "high availability"/RPO or a lag of an
> hour between primary and standby , that doesn't look good. Please
> correct me if I am missing something here.
>

Each query on the replica has a backend_xmin. You can see that in
pg_stat_activity. From that backend's perspective, tuples marked as deleted
by any transaction greater or equal to backend_xmin are still needed. This
does not depend on the table.

Now, vacuum writes to the WAL up to which point it has vacuumed on the
master. In pg_waldump this looks like so:

PRUNE snapshotConflictHorizon: 774, nredirected: 0, ndead: 5, nunused: 0,
redirected: [], dead: [2, 4, 6, 8, 10], unused: [], blkref #0: rel
1663/5/16430 blk 0

That snapshotConflictHorizon is also a transaction id. If the backend_xmin
of all backends running transactions in the same database (the 5 in 16
63/5/16430) -as the vacuum WAL record is greater than vacuum's
snapshotConflictHorizon, then there is no conflict. If any of the
backend_xmin's is less, then there is a conflict.

This type of conflict is determined by just 2 numbers, the conflict horizon
sent by the master in the WAL, and the minimum of all backend_xmins. For
your case this means a long running transaction querying table t1 might
have a backend_xmin of 223. On the master update and delete operations
happen on table T2. Since all the transactions on the master are fast, when
vacuum hits T2, the minimum of all backend_xmins on the master might
already be 425. So, garbage left over by all transactions up to 424 can be
cleaned up. Now that cleanup record reaches the replica. It compares
223>425 which is false. So, there is a conflict. Now the replica can wait
until its own horizon reaches 425 or it can kill all backends with a lower
backend_xmin.

As I understand, hot_standby_feedback does not work for you. Not sure if
you can run the query on the master? That would resolve the issues but
might generate the same bloat on the master as hot_standby_feedback.
Another option I can see is to run long running queries on a dedicated
replica with max_standby_streaming_delay set to infinity or something large
enough. If you go that way, you could also fetch the WAL from your
WAL archive instead of replicating from the master. That way the replica
has absolutely no chance to affect the master.

Good Luck!
Torsten

^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Long running query causing XID limit breach
  2024-05-22 21:16 Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 07:52 ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-23 08:11   ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 08:15     ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-23 12:45       ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 14:41         ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-24 05:04           ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-25 20:54             ` Re: Long running query causing XID limit breach yudhi s <[email protected]>
  2024-05-25 20:59               ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-26 08:13                 ` Re: Long running query causing XID limit breach Torsten Förtsch <[email protected]>
  2024-05-26 09:15                   ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-26 17:48                     ` Re: Long running query causing XID limit breach Torsten Förtsch <[email protected]>
@ 2024-05-26 18:46                       ` sud <[email protected]>
  2024-05-26 19:25                         ` Re: Long running query causing XID limit breach Torsten Förtsch <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: sud @ 2024-05-26 18:46 UTC (permalink / raw)
  To: Torsten Förtsch <[email protected]>; +Cc: pgsql-general <[email protected]>

On Sun, May 26, 2024 at 11:18 PM Torsten Förtsch <[email protected]>
wrote:

> Each query on the replica has a backend_xmin. You can see that in
> pg_stat_activity. From that backend's perspective, tuples marked as deleted
> by any transaction greater or equal to backend_xmin are still needed. This
> does not depend on the table.
>
> Now, vacuum writes to the WAL up to which point it has vacuumed on the
> master. In pg_waldump this looks like so:
>
> PRUNE snapshotConflictHorizon: 774, nredirected: 0, ndead: 5, nunused: 0,
> redirected: [], dead: [2, 4, 6, 8, 10], unused: [], blkref #0: rel
> 1663/5/16430 blk 0
>
> That snapshotConflictHorizon is also a transaction id. If the backend_xmin
> of all backends running transactions in the same database (the 5 in 16
> 63/5/16430) -as the vacuum WAL record is greater than vacuum's
> snapshotConflictHorizon, then there is no conflict. If any of the
> backend_xmin's is less, then there is a conflict.
>
> This type of conflict is determined by just 2 numbers, the conflict
> horizon sent by the master in the WAL, and the minimum of all
> backend_xmins. For your case this means a long running transaction querying
> table t1 might have a backend_xmin of 223. On the master update and delete
> operations happen on table T2. Since all the transactions on the master are
> fast, when vacuum hits T2, the minimum of all backend_xmins on the master
> might already be 425. So, garbage left over by all transactions up to 424
> can be cleaned up. Now that cleanup record reaches the replica. It compares
> 223>425 which is false. So, there is a conflict. Now the replica can wait
> until its own horizon reaches 425 or it can kill all backends with a lower
> backend_xmin.
>
> As I understand, hot_standby_feedback does not work for you. Not sure if
> you can run the query on the master? That would resolve the issues but
> might generate the same bloat on the master as hot_standby_feedback.
> Another option I can see is to run long running queries on a dedicated
> replica with max_standby_streaming_delay set to infinity or something large
> enough. If you go that way, you could also fetch the WAL from your
> WAL archive instead of replicating from the master. That way the replica
> has absolutely no chance to affect the master.
>
>
Thank you so much.

Would you agree that we should have two standby, one with default
max_standby_streaming_delay (say 10 sec ) which will be mainly used as high
availability and thus will be having minimal lag. and another standby with
max_standby_streaming_delay as "-1" i.e. it will wait indefinitely for the
SELECT queries to finish without caring about the lag, which will be
utilized for the long running SELECT queries.

And keep the hot_standby_feedback as ON for the first standby which is used
as HA/high availability. And keep the hot_standby_feedback as OFF for the
second standby which is utilized for long running SELECT queries, so that
primary won't be waiting for the response/feedback from this standby to
vacuum its old transactions and that will keep the transaction id wrap
around issue from not happening because of the Read/Select queries on any
of the standby.


^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Long running query causing XID limit breach
  2024-05-22 21:16 Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 07:52 ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-23 08:11   ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 08:15     ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-23 12:45       ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 14:41         ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-24 05:04           ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-25 20:54             ` Re: Long running query causing XID limit breach yudhi s <[email protected]>
  2024-05-25 20:59               ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-26 08:13                 ` Re: Long running query causing XID limit breach Torsten Förtsch <[email protected]>
  2024-05-26 09:15                   ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-26 17:48                     ` Re: Long running query causing XID limit breach Torsten Förtsch <[email protected]>
  2024-05-26 18:46                       ` Re: Long running query causing XID limit breach sud <[email protected]>
@ 2024-05-26 19:25                         ` Torsten Förtsch <[email protected]>
  2024-05-27 03:37                           ` Re: Long running query causing XID limit breach sud <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: Torsten Förtsch @ 2024-05-26 19:25 UTC (permalink / raw)
  To: sud <[email protected]>; +Cc: pgsql-general <[email protected]>

On Sun, May 26, 2024 at 8:46 PM sud <[email protected]> wrote:

> Would you agree that we should have two standby, one with default
> max_standby_streaming_delay (say 10 sec ) which will be mainly used as high
> availability and thus will be having minimal lag. and another standby with
> max_standby_streaming_delay as "-1" i.e. it will wait indefinitely for the
> SELECT queries to finish without caring about the lag, which will be
> utilized for the long running SELECT queries.
>
> And keep the hot_standby_feedback as ON for the first standby which is
> used as HA/high availability. And keep the hot_standby_feedback as OFF for
> the second standby which is utilized for long running SELECT queries, so
> that primary won't be waiting for the response/feedback from this standby
> to vacuum its old transactions and that will keep the transaction id wrap
> around issue from not happening because of the Read/Select queries on any
> of the standby.
>

Sure. That could work. Perhaps also set statement_timeout on the first
replica, just in case.


^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Long running query causing XID limit breach
  2024-05-22 21:16 Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 07:52 ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-23 08:11   ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 08:15     ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-23 12:45       ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 14:41         ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-24 05:04           ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-25 20:54             ` Re: Long running query causing XID limit breach yudhi s <[email protected]>
  2024-05-25 20:59               ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-26 08:13                 ` Re: Long running query causing XID limit breach Torsten Förtsch <[email protected]>
  2024-05-26 09:15                   ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-26 17:48                     ` Re: Long running query causing XID limit breach Torsten Förtsch <[email protected]>
  2024-05-26 18:46                       ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-26 19:25                         ` Re: Long running query causing XID limit breach Torsten Förtsch <[email protected]>
@ 2024-05-27 03:37                           ` sud <[email protected]>
  0 siblings, 0 replies; 24+ messages in thread

From: sud @ 2024-05-27 03:37 UTC (permalink / raw)
  To: Torsten Förtsch <[email protected]>; +Cc: pgsql-general <[email protected]>

On Mon, May 27, 2024 at 12:55 AM Torsten Förtsch <[email protected]>
wrote:

> On Sun, May 26, 2024 at 8:46 PM sud <[email protected]> wrote:
>
>> Would you agree that we should have two standby, one with default
>> max_standby_streaming_delay (say 10 sec ) which will be mainly used as high
>> availability and thus will be having minimal lag. and another standby with
>> max_standby_streaming_delay as "-1" i.e. it will wait indefinitely for the
>> SELECT queries to finish without caring about the lag, which will be
>> utilized for the long running SELECT queries.
>>
>> And keep the hot_standby_feedback as ON for the first standby which is
>> used as HA/high availability. And keep the hot_standby_feedback as OFF for
>> the second standby which is utilized for long running SELECT queries, so
>> that primary won't be waiting for the response/feedback from this standby
>> to vacuum its old transactions and that will keep the transaction id wrap
>> around issue from not happening because of the Read/Select queries on any
>> of the standby.
>>
>
> Sure. That could work. Perhaps also set statement_timeout on the first
> replica, just in case.
>

Thank you so much. Yes, planning to set it like below. Hope i am doing it
correctly.

Master/Primary First Replica/Standby for High Availability Second Replica
for Reporting
hot_standby_feedback=ON hot_standby_feedback=ON hot_standby_feedback=OFF
max_standby_streaming_delay=10 sec max_standby_streaming_delay=10 sec
max_standby_streaming_delay=-1
(Infinite)
statement_timeout = "2hrs" statement_timeout="2hrs" No statement_timeout
i.e. infinite
idle_in_transaction_session_timeout=10minutes
idle_in_transaction_session_timeout=10minutes No
idle_in_transaction_session_timeout i.e. infinite
autovacuum_freeze_max_age=100M autovacuum_freeze_max_age=100M
autovacuum_freeze_max_age=100M
Log_autovacuum_min_duration=0 Log_autovacuum_min_duration=0
Log_autovacuum_min_duration=0


^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Long running query causing XID limit breach
  2024-05-22 21:16 Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 07:52 ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-23 08:11   ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 08:15     ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-23 12:45       ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 14:41         ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-24 05:04           ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-25 20:54             ` Re: Long running query causing XID limit breach yudhi s <[email protected]>
@ 2024-05-27 09:20               ` Laurenz Albe <[email protected]>
  2024-05-28 20:04                 ` Re: Long running query causing XID limit breach yudhi s <[email protected]>
  1 sibling, 1 reply; 24+ messages in thread

From: Laurenz Albe @ 2024-05-27 09:20 UTC (permalink / raw)
  To: yudhi s <[email protected]>; sud <[email protected]>; +Cc: pgsql-general <[email protected]>

> On Fri, May 24, 2024 at 10:34 AM sud <[email protected]> wrote:
> > I am trying to understand these two parameters and each time it looks a bit confusing to me.
> > If These two parameters complement or conflict with each other.
> > 
> > Say for example, If we set hot_feedback_standby to ON (which is currently set as
> > default ON by the way), it will make the primary wait till the query completion at
> > standby and can cause such a high bump in XID in scenarios where the query on standby
> > runs for days(like in our current scenario which happens). So we were thinking of
> > setting it as OFF, to avoid the transaction ID wrap around issue..

"hot_standby_feedback" is not "on" by default; you must have changed it.

This parameter will not make anything wait.  The effect is that VACUUM on the
primary server won't remove any old row versions that are still needed by a
query on the standby.  This holds back the "xmin" horizon and can lead to
bloat and, if you consume transaction IDs quickly, to wraparound problems.

> > But as you also mentioned to set the "max_standby_streaming_delay" to -1 (which is
> > currently set as 14 second in our case) ,it will wait infinitely , till the query
> > completes on the standby and wont apply the WAL which can cause override of the XID
> > which the standby query is reading from. But wont this same behaviour be happening
> > while we have hot_feedback_standby set as "ON"?

"It" is the standby, where you set the parameter.  The primary is not affected by that
at all.  If there is a replication conflict on the standby, replay of the WAL information
is delayed until the query is done.

There is no conflict between these settings; they do something different.

> > But again for HA , in case primary down we should not be in big lag for the standby
> > and thus we want the standby also with minimal lag. And as you mentioned there will
> > never be incorrect results but at amx it will be query cancellation, so I was thinking,
> > if it's fine to just keep the "hot_feedback_standby" as OFF and let the
> > max_standby_streaming_delay set as it is like 14 sec. Let me know your thoughts.
> 

You cannot have it.
Let me repeat: you cannot have it.

The only way you can have no delay in replication AND no canceled queries is
if you use two different standby servers with different settings for
"max_standby_streaming_delay".  One of the server is for HA, the other for
your long-running queries.

Yours,
Laurenz Albe






^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Long running query causing XID limit breach
  2024-05-22 21:16 Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 07:52 ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-23 08:11   ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 08:15     ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-23 12:45       ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-23 14:41         ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
  2024-05-24 05:04           ` Re: Long running query causing XID limit breach sud <[email protected]>
  2024-05-25 20:54             ` Re: Long running query causing XID limit breach yudhi s <[email protected]>
  2024-05-27 09:20               ` Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
@ 2024-05-28 20:04                 ` yudhi s <[email protected]>
  0 siblings, 0 replies; 24+ messages in thread

From: yudhi s @ 2024-05-28 20:04 UTC (permalink / raw)
  To: Laurenz Albe <[email protected]>; +Cc: sud <[email protected]>; pgsql-general <[email protected]>

On Mon, May 27, 2024 at 2:50 PM Laurenz Albe <[email protected]>
wrote:

>
> > > But again for HA , in case primary down we should not be in big lag
> for the standby
> > > and thus we want the standby also with minimal lag. And as you
> mentioned there will
> > > never be incorrect results but at amx it will be query cancellation,
> so I was thinking,
> > > if it's fine to just keep the "hot_feedback_standby" as OFF and let the
> > > max_standby_streaming_delay set as it is like 14 sec. Let me know your
> thoughts.
> >
> You cannot have it.
> Let me repeat: you cannot have it.
>
> The only way you can have no delay in replication AND no canceled queries
> is
> if you use two different standby servers with different settings for
> "max_standby_streaming_delay".  One of the server is for HA, the other for
> your long-running queries.
>
> When you suggest having different max_standby_streaming_delay for first
replica (say 10 sec for High availability) and second replica(say -1 for
long running queries). Do you also suggest  keeping "hot_feedback_standby"
as "OFF" for all the three instances i.e. master and both the replicas?

Also OP has added a few other parameters as below, do you think these
should be needed?
 I think the master and first replica should have the same set up because
in case of any disaster to master the first replica should be able to take
the place of master.

Master/Primary First Replica/Standby for High Availability Second Replica
for Reporting
hot_standby_feedback=ON hot_standby_feedback=ON hot_standby_feedback=OFF
max_standby_streaming_delay=10 sec max_standby_streaming_delay=10 sec
max_standby_streaming_delay=-1
(Infinite)
statement_timeout = "2hrs" statement_timeout="2hrs" No statement_timeout
i.e. infinite
idle_in_transaction_session_timeout=10minutes
idle_in_transaction_session_timeout=10minutes No
idle_in_transaction_session_timeout i.e. infinite
autovacuum_freeze_max_age=100M autovacuum_freeze_max_age=100M
autovacuum_freeze_max_age=100M
Log_autovacuum_min_duration=0 Log_autovacuum_min_duration=0
Log_autovacuum_min_duration=0


^ permalink  raw  reply  [nested|flat] 24+ messages in thread

end of thread, other threads:[~2024-05-28 20:04 UTC | newest]

Thread overview: 24+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2024-05-22 21:16 Long running query causing XID limit breach sud <[email protected]>
2024-05-23 03:29 ` Muhammad Salahuddin Manzoor <[email protected]>
2024-05-23 04:48   ` sud <[email protected]>
2024-05-23 05:12     ` Muhammad Salahuddin Manzoor <[email protected]>
2024-05-23 06:11       ` sud <[email protected]>
2024-05-23 06:25         ` sud <[email protected]>
2024-05-23 06:51           ` Muhammad Salahuddin Manzoor <[email protected]>
2024-05-23 07:19         ` yudhi s <[email protected]>
2024-05-23 07:52 ` Laurenz Albe <[email protected]>
2024-05-23 08:11   ` sud <[email protected]>
2024-05-23 08:15     ` Laurenz Albe <[email protected]>
2024-05-23 12:45       ` sud <[email protected]>
2024-05-23 14:41         ` Laurenz Albe <[email protected]>
2024-05-24 05:04           ` sud <[email protected]>
2024-05-25 20:54             ` yudhi s <[email protected]>
2024-05-25 20:59               ` sud <[email protected]>
2024-05-26 08:13                 ` Torsten Förtsch <[email protected]>
2024-05-26 09:15                   ` sud <[email protected]>
2024-05-26 17:48                     ` Torsten Förtsch <[email protected]>
2024-05-26 18:46                       ` sud <[email protected]>
2024-05-26 19:25                         ` Torsten Förtsch <[email protected]>
2024-05-27 03:37                           ` sud <[email protected]>
2024-05-27 09:20               ` Laurenz Albe <[email protected]>
2024-05-28 20:04                 ` yudhi s <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox