Re: Long running query causing XID limit breach

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: Long running query causing XID limit breach
8+ messages / 4 participants
[nested] [flat]

* Re: Long running query causing XID limit breach
@ 2024-05-29 08:08  Laurenz Albe <[email protected]>
  0 siblings, 1 reply; 8+ messages in thread

From: Laurenz Albe @ 2024-05-29 08:08 UTC (permalink / raw)
  To: yudhi s <[email protected]>; +Cc: sud <[email protected]>; pgsql-general <[email protected]>

On Wed, 2024-05-29 at 01:34 +0530, yudhi s wrote:
> > The only way you can have no delay in replication AND no canceled queries is
> > if you use two different standby servers with different settings for
> > "max_standby_streaming_delay".  One of the server is for HA, the other for
> > your long-running queries.
>
> When you suggest having different max_standby_streaming_delay for first replica
> (say 10 sec for High availability) and second replica(say -1 for long running queries).
> Do you also suggest  keeping "hot_feedback_standby" as "OFF" for all the three
> instances i.e. master and both the replicas?

The parameter is ignored on the master.
It needs to be off on the standby that is running long queries.
For the other standby it probably doesn't matter if you are not running any
queries on it.  I would leave "hot_standby_feedback = off" there as well.

Actually, I would set "hot_standby = off" on the standby that is only used
for HA.

> Also OP has added a few other parameters as below, do you think these should be needed? 
>  I think the master and first replica should have the same set up because in case
> of any disaster to master the first replica should be able to take the place of master.
>  
> Master/Primary				First Replica/Standby for High Availability	Second Replica for Reporting
> hot_standby_feedback=ON			hot_standby_feedback=ON				hot_standby_feedback=OFF
> max_standby_streaming_delay=10 sec		max_standby_streaming_delay=10 sec		max_standby_streaming_delay=-1 (Infinite)
> statement_timeout = "2hrs"			statement_timeout="2hrs"			No statement_timeout i.e. infinite
> idle_in_transaction_session_timeout=10minutes	idle_in_transaction_session_timeout=10minutes	No idle_in_transaction_session_timeout i.e. infinite
> autovacuum_freeze_max_age=100M		autovacuum_freeze_max_age=100M			autovacuum_freeze_max_age=100M
> Log_autovacuum_min_duration=0			Log_autovacuum_min_duration=0			Log_autovacuum_min_duration=0

- I would leave "hot_standby_feedback" off everywhere.
- "max_standby_streaming_delay" should be -1 on the reporting standby and very
  low or 0 on the HA standby. It doesn't matter on the primary.
- "statement_timeout" should be way lower on the first two nodes.
- "idle_in_transaction_session_timeout" is good.
- I would leave "autovacuum_freeze_max_age" at the default setting but 100 million
  is ok too.

Yours,
Laurenz Albe






^ permalink  raw  reply  [nested|flat] 8+ messages in thread

* Re: Long running query causing XID limit breach
@ 2024-06-05 06:25  sud <[email protected]>
  parent: Laurenz Albe <[email protected]>
  0 siblings, 1 reply; 8+ messages in thread

From: sud @ 2024-06-05 06:25 UTC (permalink / raw)
  To: Laurenz Albe <[email protected]>; +Cc: yudhi s <[email protected]>; pgsql-general <[email protected]>

Hello Laurenz,

Thank you so much.This information was really helpful for us
understanding the working of these parameters.

One follow up question i have , as we are setting one of the
standby/replica with value idle_in_transaction_session_timeout=-1 which can
cause the WAL's to be heavily backlogged in a scenario where we have a
query running for very long time on that instance. So in that case will
there be chances of instance restart and if that can be avoided anyway?

And the plan is to set these system parameters with different values in
writer/read replica , so in that case if we apply the "alter system"
command on the primary , won't the WAL going to apply those same commands
forcibly on reader instance making those same as the writer instance
configuration( but we want the reader replica configuration to be different
from writer)?

Appreciate your guidance.

On Wed, May 29, 2024 at 1:38 PM Laurenz Albe <[email protected]>
wrote:

> On Wed, 2024-05-29 at 01:34 +0530, yudhi s wrote:
> > > The only way you can have no delay in replication AND no canceled
> queries is
> > > if you use two different standby servers with different settings for
> > > "max_standby_streaming_delay".  One of the server is for HA, the other
> for
> > > your long-running queries.
> >
> > When you suggest having different max_standby_streaming_delay for first
> replica
> > (say 10 sec for High availability) and second replica(say -1 for long
> running queries).
> > Do you also suggest  keeping "hot_feedback_standby" as "OFF" for all the
> three
> > instances i.e. master and both the replicas?
>
> The parameter is ignored on the master.
> It needs to be off on the standby that is running long queries.
> For the other standby it probably doesn't matter if you are not running any
> queries on it.  I would leave "hot_standby_feedback = off" there as well.
>
> Actually, I would set "hot_standby = off" on the standby that is only used
> for HA.
>
>
> - I would leave "hot_standby_feedback" off everywhere.
> - "max_standby_streaming_delay" should be -1 on the reporting standby and
> very
>   low or 0 on the HA standby. It doesn't matter on the primary.
> - "statement_timeout" should be way lower on the first two nodes.
> - "idle_in_transaction_session_timeout" is good.
> - I would leave "autovacuum_freeze_max_age" at the default setting but 100
> million
>   is ok too.
>
> Yours,
> Laurenz Albe
>

^ permalink  raw  reply  [nested|flat] 8+ messages in thread

* Re: Long running query causing XID limit breach
@ 2024-06-05 07:08  Simon Elbaz <[email protected]>
  parent: sud <[email protected]>
  0 siblings, 1 reply; 8+ messages in thread

From: Simon Elbaz @ 2024-06-05 07:08 UTC (permalink / raw)
  To: sud <[email protected]>; +Cc: Laurenz Albe <[email protected]>; yudhi s <[email protected]>; pgsql-general <[email protected]>

Hi,

I am following this very interesting thread.

From the documentation
https://www.postgresql.org/docs/current/runtime-config-client.html#GUC-IDLE-IN-TRANSACTION-SESSION-T...,
the 0 value will disable the timeout (not -1).



On Wed, Jun 5, 2024 at 8:25 AM sud <[email protected]> wrote:

> Hello Laurenz,
>
> Thank you so much.This information was really helpful for us
> understanding the working of these parameters.
>
> One follow up question i have , as we are setting one of the
> standby/replica with value idle_in_transaction_session_timeout=-1 which can
> cause the WAL's to be heavily backlogged in a scenario where we have a
> query running for very long time on that instance. So in that case will
> there be chances of instance restart and if that can be avoided anyway?
>
> And the plan is to set these system parameters with different values in
> writer/read replica , so in that case if we apply the "alter system"
> command on the primary , won't the WAL going to apply those same commands
> forcibly on reader instance making those same as the writer instance
> configuration( but we want the reader replica configuration to be different
> from writer)?
>
> Appreciate your guidance.
>
> On Wed, May 29, 2024 at 1:38 PM Laurenz Albe <[email protected]>
> wrote:
>
>> On Wed, 2024-05-29 at 01:34 +0530, yudhi s wrote:
>> > > The only way you can have no delay in replication AND no canceled
>> queries is
>> > > if you use two different standby servers with different settings for
>> > > "max_standby_streaming_delay".  One of the server is for HA, the
>> other for
>> > > your long-running queries.
>> >
>> > When you suggest having different max_standby_streaming_delay for first
>> replica
>> > (say 10 sec for High availability) and second replica(say -1 for long
>> running queries).
>> > Do you also suggest  keeping "hot_feedback_standby" as "OFF" for all
>> the three
>> > instances i.e. master and both the replicas?
>>
>> The parameter is ignored on the master.
>> It needs to be off on the standby that is running long queries.
>> For the other standby it probably doesn't matter if you are not running
>> any
>> queries on it.  I would leave "hot_standby_feedback = off" there as well.
>>
>> Actually, I would set "hot_standby = off" on the standby that is only used
>> for HA.
>>
>>
>> - I would leave "hot_standby_feedback" off everywhere.
>> - "max_standby_streaming_delay" should be -1 on the reporting standby and
>> very
>>   low or 0 on the HA standby. It doesn't matter on the primary.
>> - "statement_timeout" should be way lower on the first two nodes.
>> - "idle_in_transaction_session_timeout" is good.
>> - I would leave "autovacuum_freeze_max_age" at the default setting but
>> 100 million
>>   is ok too.
>>
>> Yours,
>> Laurenz Albe
>>
>


^ permalink  raw  reply  [nested|flat] 8+ messages in thread

* Re: Long running query causing XID limit breach
@ 2024-06-05 07:39  sud <[email protected]>
  parent: Simon Elbaz <[email protected]>
  0 siblings, 1 reply; 8+ messages in thread

From: sud @ 2024-06-05 07:39 UTC (permalink / raw)
  To: Simon Elbaz <[email protected]>; Laurenz Albe <[email protected]>; +Cc: yudhi s <[email protected]>; pgsql-general <[email protected]>

On Wed, 5 Jun, 2024, 12:39 pm Simon Elbaz, <[email protected]> wrote:

> Hi,
>
> I am following this very interesting thread.
>
> From the documentation
> https://www.postgresql.org/docs/current/runtime-config-client.html#GUC-IDLE-IN-TRANSACTION-SESSION-T...,
> the 0 value will disable the timeout (not -1).
>
>
>
> On Wed, Jun 5, 2024 at 8:25 AM sud <[email protected]> wrote:
>
>> Hello Laurenz,
>>
>> Thank you so much.This information was really helpful for us
>> understanding the working of these parameters.
>>
>> One follow up question i have , as we are setting one of the
>> standby/replica with value idle_in_transaction_session_timeout=-1 which can
>> cause the WAL's to be heavily backlogged in a scenario where we have a
>> query running for very long time on that instance. So in that case will
>> there be chances of instance restart and if that can be avoided anyway?
>>
>> And the plan is to set these system parameters with different values in
>> writer/read replica , so in that case if we apply the "alter system"
>> command on the primary , won't the WAL going to apply those same commands
>> forcibly on reader instance making those same as the writer instance
>> configuration( but we want the reader replica configuration to be different
>> from writer)?
>>
>> Appreciate your guidance.
>>
>>
>
My apologies. I was meant to say setting up "max_standby_streaming_delay"
To -1. Which means unlimited lag.

>


^ permalink  raw  reply  [nested|flat] 8+ messages in thread

* Re: Long running query causing XID limit breach
@ 2024-06-05 10:22  Laurenz Albe <[email protected]>
  parent: sud <[email protected]>
  0 siblings, 1 reply; 8+ messages in thread

From: Laurenz Albe @ 2024-06-05 10:22 UTC (permalink / raw)
  To: sud <[email protected]>; Simon Elbaz <[email protected]>; +Cc: yudhi s <[email protected]>; pgsql-general <[email protected]>

On Wed, 2024-06-05 at 13:09 +0530, sud wrote:
> > > One follow up question i have , as we are setting one of the standby/replica
> > > with value idle_in_transaction_session_timeout=-1 which can cause the WAL's
> > > to be heavily backlogged in a scenario where we have a query running for very
> > > long time on that instance. So in that case will there be chances of instance
> > > restart and if that can be avoided anyway?
> 
> My apologies. I was meant to say setting up "max_standby_streaming_delay" To -1.
> Which means unlimited lag. 

There should never be a restart unless you perform one or the standby crashes.
If you mean that you want to avoid a crash caused by a full disk on the standby,
the answer is probably "no".  Make sure that you have enough disk space and
use monitoring.

Yours,
Laurenz Albe






^ permalink  raw  reply  [nested|flat] 8+ messages in thread

* Re: Long running query causing XID limit breach
@ 2024-06-05 19:21  yudhi s <[email protected]>
  parent: Laurenz Albe <[email protected]>
  0 siblings, 1 reply; 8+ messages in thread

From: yudhi s @ 2024-06-05 19:21 UTC (permalink / raw)
  To: Laurenz Albe <[email protected]>; +Cc: sud <[email protected]>; Simon Elbaz <[email protected]>; pgsql-general <[email protected]>

On Wed, Jun 5, 2024 at 3:52 PM Laurenz Albe <[email protected]>
wrote:

>
> There should never be a restart unless you perform one or the standby
> crashes.
> If you mean that you want to avoid a crash caused by a full disk on the
> standby,
> the answer is probably "no".  Make sure that you have enough disk space and
> use monitoring.
>
> Yours,
> Laurenz Albe
>

Is this because OP initially mentioned its RDS postgres, so in that case
there is storage space restriction on 64TB(and 128TB in case of aurora
postgres). So I believe this storage space combines data + WAL , so in that
case as you mentioned, appropriate monitoring needs to be put in place.
Or else in the worst case scenario, if the storage consumption hit that
hard limit , then there will be instance restart or crash?

^ permalink  raw  reply  [nested|flat] 8+ messages in thread

* Re: Long running query causing XID limit breach
@ 2024-06-08 09:21  sud <[email protected]>
  parent: yudhi s <[email protected]>
  0 siblings, 1 reply; 8+ messages in thread

From: sud @ 2024-06-08 09:21 UTC (permalink / raw)
  To: yudhi s <[email protected]>; +Cc: Laurenz Albe <[email protected]>; Simon Elbaz <[email protected]>; pgsql-general <[email protected]>

On Thu, Jun 6, 2024 at 12:52 AM yudhi s <[email protected]> wrote:

> On Wed, Jun 5, 2024 at 3:52 PM Laurenz Albe <[email protected]>
> wrote:
>
>>
>> There should never be a restart unless you perform one or the standby
>> crashes.
>> If you mean that you want to avoid a crash caused by a full disk on the
>> standby,
>> the answer is probably "no".  Make sure that you have enough disk space
>> and
>> use monitoring.
>>
>> Yours,
>> Laurenz Albe
>>
>
> Is this because OP initially mentioned its RDS postgres, so in that case
> there is storage space restriction on 64TB(and 128TB in case of aurora
> postgres). So I believe this storage space combines data + WAL , so in that
> case as you mentioned, appropriate monitoring needs to be put in place.
> Or else in the worst case scenario, if the storage consumption hit that
> hard limit , then there will be instance restart or crash?
>

Thank You so much Laurenz and Yudhi.

Yes its RDS and as you mentioned there does exist a space limitation of
~64TB but as Laurenz mentioned the only time the second standby may crash
would be probably because of  the storage space saturation and thus we need
to have appropriate monitoring in place to find this and get alerted
beforehand. And also a monitoring to see how much WAL gets generated per
hour/day to get an idea of the usage. I am not sure how to do it , but will
check on this.

^ permalink  raw  reply  [nested|flat] 8+ messages in thread

* Re: Long running query causing XID limit breach
@ 2024-06-13 07:43  yudhi s <[email protected]>
  parent: sud <[email protected]>
  0 siblings, 0 replies; 8+ messages in thread

From: yudhi s @ 2024-06-13 07:43 UTC (permalink / raw)
  To: sud <[email protected]>; Laurenz Albe <[email protected]>; +Cc: Simon Elbaz <[email protected]>; pgsql-general <[email protected]>

On Sat, Jun 8, 2024 at 2:51 PM sud <[email protected]> wrote:

>
> Thank You so much Laurenz and Yudhi.
>
> Yes its RDS and as you mentioned there does exist a space limitation of
> ~64TB but as Laurenz mentioned the only time the second standby may crash
> would be probably because of  the storage space saturation and thus we need
> to have appropriate monitoring in place to find this and get alerted
> beforehand. And also a monitoring to see how much WAL gets generated per
> hour/day to get an idea of the usage. I am not sure how to do it , but will
> check on this.
>

Not exactly related but just for our information, While going through the
"aurora postgres" database docs in regards to similar concepts which are
getting discussed here, I am finding some interesting stuff.

https://aws.amazon.com/blogs/database/manage-long-running-read-queries-on-amazon-aurora-postgresql-c...

*Cancel the conflicting query on the reader node if the conflict lasts
longer than max_standby_streaming_delay (maximum 30 seconds). This is
different from Amazon RDS or self-managed PostgreSQL. With Amazon RDS or
self-managed PostgreSQL, the instance has its own physical copy of the
database, and you’re able to set the parameter max_standby_streaming_delay
as high as you want to prevent query cancellation.If the conflicting query
can’t cancel in time, or if multiple long-running queries are causing the
replication lag to go beyond 60 seconds, Aurora restarts the reader node to
ensure it’s not lagging far behind the primary node.*

So if i get it correct it means, even if hot_standby_feedback is set to OFF,
the constraints of max_standby_streaming_delay (30 seconds) and the
60-second replication lag limit applies. And thus Aurora may cancel
long-running queries or restart reader nodes to maintain synchronization
even if it just runs for >60seconds.  So it's really odd but does that mean
, by no way you can guarantee a query to run >60 seconds on read replica in
aurora postgres?

^ permalink  raw  reply  [nested|flat] 8+ messages in thread

end of thread, other threads:[~2024-06-13 07:43 UTC | newest]

Thread overview: 8+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2024-05-29 08:08 Re: Long running query causing XID limit breach Laurenz Albe <[email protected]>
2024-06-05 06:25 ` sud <[email protected]>
2024-06-05 07:08   ` Simon Elbaz <[email protected]>
2024-06-05 07:39     ` sud <[email protected]>
2024-06-05 10:22       ` Laurenz Albe <[email protected]>
2024-06-05 19:21         ` yudhi s <[email protected]>
2024-06-08 09:21           ` sud <[email protected]>
2024-06-13 07:43             ` yudhi s <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox