public inbox for [email protected]  
help / color / mirror / Atom feed
Proposal: recent access based routing for primary-replica setups
61+ messages / 2 participants
[nested] [flat]

* Proposal: recent access based routing for primary-replica setups
@ 2025-08-17 13:27  Nadav Shatz <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Nadav Shatz @ 2025-08-17 13:27 UTC (permalink / raw)
  To: [email protected]

Hello all,

My name is Nadav Shatz, I’m the CTO at Tailor Brands and have been working
with PostgreSQL in high-traffic, distributed environments for many years.
Most of my focus has been on backend architecture, scaling, and performance
optimization, and I’m a long-time user and admirer of the Postgres
ecosystem.

I’d like to propose adding a feature to pgpool-II for *recent access based
routing* in primary-replica setups. The idea is similar to what we’ve
described in this article
<https://medium.com/tailor-tech/using-database-read-replicas-in-distributed-systems-d80eaf6bbf8a;,
and is also reflected in this pgcat PR
<https://github.com/postgresml/pgcat/pull/864;. The core concept is to
route read queries to the primary if they occur shortly after a write,
reducing replica lag inconsistencies while still benefiting from read
scaling.

*How it would work (high-level):*


   -


*External “effective lag” via config (hot-reloaded): *Instead of relying on
   pgpool-II’s replication delay checks (which don’t map well to Aurora
   semantics), we’ll expose a *config value* representing the effective
   replica lag (or directly the TTL to use for “recency”). This value
is *pushed
   by an external controller* and *hot-reloaded* (no restarts). The
   relevant knobs might look like:

   -

      enable_recent_access_routing (boolean, default off)
      -

      recent_access_ttl_ms (integer, default 0, can be hot-reloaded)
      -

      enable_query_parser (boolean, required for this feature, default off)

   -


*In-memory recent-access map: *Each worker maintains a lightweight per-DB
   in-memory map of *recently written relations*. On any write
   (INSERT/UPDATE/DELETE/UPSERT/TRUNCATE), we record the touched relations
   with a TTL derived from recent_access_ttl_ms. Entries expire
   automatically; writes refresh them.
   -


*Routing + query parsing: *For incoming statements we parse enough to
   answer two questions: (1) is it a read or a write? and (2) which relations
   are referenced? If a read touches any “recently written” relation, we *force
   route to primary*; otherwise we allow normal read load-balancing to
   replicas.

*Notes on behavior & ops:*


   -

   *Config & hot reload:* Operators (or an external controller) can update
   recent_access_ttl_ms dynamically and trigger hot reload to adapt to
   changing conditions—no reliance on Aurora internals.
   -

   *Safety levers:* a global max TTL, optional allow/deny lists, and
   metrics (e.g., “reads forced to primary due to recency”) for visibility.
   -

   *Defaults & compatibility:* all defaults are safe/off; enabling requires
   explicit opt-in.

I’ll prepare the code changes and send a patch/PR, but before diving in I
wanted to check if anyone has *objections, concerns, or preferred
alternatives*—particularly around parser hooks, shared memory use, or
hot-reload mechanics in pgpool-II.

Thanks for considering,
-- 
Nadav Shatz
Tailor Brands | CTO


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-08-18 12:51  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2025-08-18 12:51 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

Hello Nadav,

Thank you for the proposal. I have a few questions.

> Hello all,
> 
> My name is Nadav Shatz, I’m the CTO at Tailor Brands and have been working
> with PostgreSQL in high-traffic, distributed environments for many years.
> Most of my focus has been on backend architecture, scaling, and performance
> optimization, and I’m a long-time user and admirer of the Postgres
> ecosystem.
> 
> I’d like to propose adding a feature to pgpool-II for *recent access based
> routing* in primary-replica setups. The idea is similar to what we’ve
> described in this article
> <https://medium.com/tailor-tech/using-database-read-replicas-in-distributed-systems-d80eaf6bbf8a;,
> and is also reflected in this pgcat PR
> <https://github.com/postgresml/pgcat/pull/864;. The core concept is to
> route read queries to the primary if they occur shortly after a write,
> reducing replica lag inconsistencies while still benefiting from read
> scaling.
> 
> *How it would work (high-level):*
> 
> 
>    -
> 
> 
> *External “effective lag” via config (hot-reloaded): *Instead of relying on
>    pgpool-II’s replication delay checks (which don’t map well to Aurora
>    semantics), we’ll expose a *config value* representing the effective
>    replica lag (or directly the TTL to use for “recency”). This value
> is *pushed
>    by an external controller* and *hot-reloaded* (no restarts). The
>    relevant knobs might look like:
>
>    -
> 
>       enable_recent_access_routing (boolean, default off)
>       -
> 
>       recent_access_ttl_ms (integer, default 0, can be hot-reloaded)

If my understanding is correct, the "external controller" updates
"recent_access_ttl_ms" to let pgpool know the current delay of
replica. My question is, what if there are multiple replicas. In this
case the "external controller" calculates the average latency of each
replica?

Another question is, how often the external controller updates and
reload pgpool.conf. If it's like every second, probably it could give
unacceptable load to pgpool because reloading pgpool.conf is expensive
operation.

>       enable_query_parser (boolean, required for this feature, default off)

What does this do? Why do you need this?

> *In-memory recent-access map: *Each worker maintains a lightweight per-DB
>    in-memory map of *recently written relations*. On any write

Is "per-DB in-memory map" in shared memory?

>    (INSERT/UPDATE/DELETE/UPSERT/TRUNCATE), we record the touched relations
>    with a TTL derived from recent_access_ttl_ms. Entries expire
>    automatically; writes refresh them.

How do you automatically expire the entries? Are you going to
implement something like a auto sweeper process?

> *Routing + query parsing: *For incoming statements we parse enough to
>    answer two questions: (1) is it a read or a write? and (2) which relations
>    are referenced? If a read touches any “recently written” relation, we *force
>    route to primary*; otherwise we allow normal read load-balancing to
>    replicas.

Pgpool-II already does (1) and (2).

> *Notes on behavior & ops:*
> 
> 
>    -
> 
>    *Config & hot reload:* Operators (or an external controller) can update
>    recent_access_ttl_ms dynamically and trigger hot reload to adapt to
>    changing conditions―no reliance on Aurora internals.
>    -
> 
>    *Safety levers:* a global max TTL, optional allow/deny lists, and
>    metrics (e.g., “reads forced to primary due to recency”) for visibility.

Please elaborate more on this. Allow/deny what?

>    *Defaults & compatibility:* all defaults are safe/off; enabling requires
>    explicit opt-in.

Sounds good.

> I’ll prepare the code changes and send a patch/PR, but before diving in I
> wanted to check if anyone has *objections, concerns, or preferred
> alternatives*―particularly around parser hooks, shared memory use, or
> hot-reload mechanics in pgpool-II.

Probably you should consider adding a pcp command to notice pgpool the
"recent_access_ttl_ms". That is far more efficient than reloading
pgpool.conf.

> Thanks for considering,
> -- 
> Nadav Shatz
> Tailor Brands | CTO

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp





^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-08-18 14:11  Nadav Shatz <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Nadav Shatz @ 2025-08-18 14:11 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

Hi Tatsuo,

Thank you very much for your reply and questions!
I'll try and respond to everything inline, please let me know if I missed
something or if anything isn't clear enough.

On Mon, Aug 18, 2025 at 3:51 PM Tatsuo Ishii <[email protected]> wrote:

> Hello Nadav,
>
> Thank you for the proposal. I have a few questions.
>
> > Hello all,
> >
> > My name is Nadav Shatz, I’m the CTO at Tailor Brands and have been
> working
> > with PostgreSQL in high-traffic, distributed environments for many years.
> > Most of my focus has been on backend architecture, scaling, and
> performance
> > optimization, and I’m a long-time user and admirer of the Postgres
> > ecosystem.
> >
> > I’d like to propose adding a feature to pgpool-II for *recent access
> based
> > routing* in primary-replica setups. The idea is similar to what we’ve
> > described in this article
> > <
> https://medium.com/tailor-tech/using-database-read-replicas-in-distributed-systems-d80eaf6bbf8a
> >,
> > and is also reflected in this pgcat PR
> > <https://github.com/postgresml/pgcat/pull/864;. The core concept is to
> > route read queries to the primary if they occur shortly after a write,
> > reducing replica lag inconsistencies while still benefiting from read
> > scaling.
> >
> > *How it would work (high-level):*
> >
> >
> >    -
> >
> >
> > *External “effective lag” via config (hot-reloaded): *Instead of relying
> on
> >    pgpool-II’s replication delay checks (which don’t map well to Aurora
> >    semantics), we’ll expose a *config value* representing the effective
> >    replica lag (or directly the TTL to use for “recency”). This value
> > is *pushed
> >    by an external controller* and *hot-reloaded* (no restarts). The
> >    relevant knobs might look like:
> >
> >    -
> >
> >       enable_recent_access_routing (boolean, default off)
> >       -
> >
> >       recent_access_ttl_ms (integer, default 0, can be hot-reloaded)
>
> If my understanding is correct, the "external controller" updates
> "recent_access_ttl_ms" to let pgpool know the current delay of
> replica. My question is, what if there are multiple replicas. In this
> case the "external controller" calculates the average latency of each
> replica?
>
> Another question is, how often the external controller updates and
> reload pgpool.conf. If it's like every second, probably it could give
> unacceptable load to pgpool because reloading pgpool.conf is expensive
> operation.
>
>
You understood correctly - my plan was to keep it as generic as possible
and leave all logic to be handled by the external controller. Basically
leaving all of these decisions (how often to update, calculation, etc.) to
the external implementation as it can get very case specific.
This approach comes from the need of replica lag understanding under AWS
Aurora - which doesn't expose these metrics from the DB itself.
I also thought of implementing a couple of other possible mechanisms:
1. use a pcp command like you suggest below, i wasn't aware of the option,
this will handle the expensive operation but no other concerns mentioned.
2. we can implement support to using the AWS Aurora API directly for the
lag, while this is cloud provider and db "flavor" specific, it is a very
large and common use case. Doing this will open up all other pgpool
features that rely on the lag values being available. From a performance
perspective it is probably best.

>       enable_query_parser (boolean, required for this feature, default
> off)
>
> What does this do? Why do you need this?
>

this was referring to enabling the auto routing already existing in pgpool
(based on query content), the naming is wrong.
basically meant to say - if the auto routing is disabled, there is no point
in enabling the latest access based routing.
Sorry for the confusion.


>
> > *In-memory recent-access map: *Each worker maintains a lightweight per-DB
> >    in-memory map of *recently written relations*. On any write
>
> Is "per-DB in-memory map" in shared memory?
>

Yes


>
> >    (INSERT/UPDATE/DELETE/UPSERT/TRUNCATE), we record the touched
> relations
> >    with a TTL derived from recent_access_ttl_ms. Entries expire
> >    automatically; writes refresh them.
>
> How do you automatically expire the entries? Are you going to
> implement something like a auto sweeper process?
>

Great question - maybe combine that with a lazy deletion process on read.
similar to what memcached is doing.


>
> > *Routing + query parsing: *For incoming statements we parse enough to
> >    answer two questions: (1) is it a read or a write? and (2) which
> relations
> >    are referenced? If a read touches any “recently written” relation, we
> *force
> >    route to primary*; otherwise we allow normal read load-balancing to
> >    replicas.
>
> Pgpool-II already does (1) and (2).
>

1 - of course, i'm trying to build on top of it.
2 - maybe i'm not understanding the existing documentation correctly - but
i couldn't find something that takes the specific relations (tables) under
consideration, only query type (Read/Write) or passing the delay_threshold.
Our approach here basically accepts no delay for these specific relations.
so you get guaranteed data freshness at the expense of checking the
specific table. it's a different kind of tradeoff.
the whole approach can be expanded to take further "generic values" under
considerations if needed to also take "tenant" id for instance under
consideration. tho for those cases, using a table per tenant already solves
that.

Please let me know what i might be missing here.


>
> > *Notes on behavior & ops:*
> >
> >
> >    -
> >
> >    *Config & hot reload:* Operators (or an external controller) can
> update
> >    recent_access_ttl_ms dynamically and trigger hot reload to adapt to
> >    changing conditions―no reliance on Aurora internals.
> >    -
> >
> >    *Safety levers:* a global max TTL, optional allow/deny lists, and
> >    metrics (e.g., “reads forced to primary due to recency”) for
> visibility.
>
> Please elaborate more on this. Allow/deny what?
>

We can add "table list" that would ignore the feature, or in reverse as an
allow list that would enable it only for specific tables. I don't think
that's needed, especially not for V1.


> >    *Defaults & compatibility:* all defaults are safe/off; enabling
> requires
> >    explicit opt-in.
>
> Sounds good.
>
> > I’ll prepare the code changes and send a patch/PR, but before diving in I
> > wanted to check if anyone has *objections, concerns, or preferred
> > alternatives*―particularly around parser hooks, shared memory use, or
> > hot-reload mechanics in pgpool-II.
>
> Probably you should consider adding a pcp command to notice pgpool the
> "recent_access_ttl_ms". That is far more efficient than reloading
> pgpool.conf.


Great idea! i wasn't aware of the mechanism to be honest.

lastly another note that came up - we can disable the feature and load
balancing in case that we have to evict old items in the map. or have it
configurable how to behave in such a scenario.


>


> > Thanks for considering,
> > --
> > Nadav Shatz
> > Tailor Brands | CTO
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>

Best regards,
-- 
Nadav Shatz
Tailor Brands | CTO


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-08-20 12:45  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2025-08-20 12:45 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

Hi Nadav,

Thank you for the answer.

I think your proposal actually includes two orthogonal proposals.

(1) "inject" replication delay value from external source (in your
case from Aurora).

(2) per relation recent access based routing.

I suggest to implement (1) first, then (2). This incremental approach
would be easier than implementing (1)+(2) at once.

For (1) we could add new pgpool.conf parameter, say
"replication_delay_source". If it is set to "builtin", then
replication delay source is PostgreSQL as we already does today. If
it's set other than "builtin", then it's an external command name (+
arguments) to be executed to import replication delay value. The
command should return replication delay value represented in strings
like "0 20 10", which means node 0, 1 and 2 replication delay values
in millisecond (in this case since the node 0 is primary, its
replication delay is 0). The command will be invoked every
sr_check_period.

I am not sure if this actually works in Aurora. This is just a quick
idea.

(2) would be probably much harder than (1). So we need more discussion
later on.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp





^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-08-20 14:27  Nadav Shatz <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Nadav Shatz @ 2025-08-20 14:27 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

Hi Tatsuo,

Thank you for your reply, I agree with your approach. Better to get (1) out
of the way first.

As a simplest approach that we can implement that would support completely
offloading the responsibility of the lag checking we can set it to “file”
and add another config for file path. Or just if starts with “file:” it’ll
understand.
Then the internal polling can just read the file on schedule. The entire
updating mechanism will be left to the external service.

Having this as a first step also opens up the door for other
implementations.

Another classic option would be calling an API endpoint. But that might
come with a lot more bulk and security concerns.

I suggest I work on a patch for file support.

What do you think?

Nadav Shatz
Tailor Brands | CTO


On Wed, Aug 20, 2025 at 3:45 PM Tatsuo Ishii <[email protected]> wrote:

> Hi Nadav,
>
> Thank you for the answer.
>
> I think your proposal actually includes two orthogonal proposals.
>
> (1) "inject" replication delay value from external source (in your
> case from Aurora).
>
> (2) per relation recent access based routing.
>
> I suggest to implement (1) first, then (2). This incremental approach
> would be easier than implementing (1)+(2) at once.
>
> For (1) we could add new pgpool.conf parameter, say
> "replication_delay_source". If it is set to "builtin", then
> replication delay source is PostgreSQL as we already does today. If
> it's set other than "builtin", then it's an external command name (+
> arguments) to be executed to import replication delay value. The
> command should return replication delay value represented in strings
> like "0 20 10", which means node 0, 1 and 2 replication delay values
> in millisecond (in this case since the node 0 is primary, its
> replication delay is 0). The command will be invoked every
> sr_check_period.
>
> I am not sure if this actually works in Aurora. This is just a quick
> idea.
>
> (2) would be probably much harder than (1). So we need more discussion
> later on.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-08-21 05:04  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2025-08-21 05:04 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

Hi Nadav,

> Hi Tatsuo,
> 
> Thank you for your reply, I agree with your approach. Better to get (1) out
> of the way first.
> 
> As a simplest approach that we can implement that would support completely
> offloading the responsibility of the lag checking we can set it to “file”
> and add another config for file path. Or just if starts with “file:” it’ll
> understand.

My concern about the "file:" approach is, race condition. What if
pgpool reads the file while it is being updated by someone else?  Also
I think the command approach is more flexible and generic. For
example, the "file approch" can be easily simulated by setting the
command "/usr/bin/cat path_to_the_file".

> Then the internal polling can just read the file on schedule. The entire
> updating mechanism will be left to the external service.

Internal polling is a little bit complicated and will not be easily
changed to just reading a file. The internal polling has two options:
one is checking WAL LSN difference, the other is replication delay in
time. The file approch would only replace the latter. I suggest to
leave the internal polling code as it is.

> Having this as a first step also opens up the door for other
> implementations.
> 
> Another classic option would be calling an API endpoint. But that might
> come with a lot more bulk and security concerns.

I agree that calling API could bring security concerns.

BTW, in the command approch, the command should be executed as
sr_check_user.

> I suggest I work on a patch for file support.
> 
> What do you think?

For the reason above I prefer the command approch, not the file
support.

> Nadav Shatz
> Tailor Brands | CTO
> 
> 
> On Wed, Aug 20, 2025 at 3:45 PM Tatsuo Ishii <[email protected]> wrote:
> 
>> Hi Nadav,
>>
>> Thank you for the answer.
>>
>> I think your proposal actually includes two orthogonal proposals.
>>
>> (1) "inject" replication delay value from external source (in your
>> case from Aurora).
>>
>> (2) per relation recent access based routing.
>>
>> I suggest to implement (1) first, then (2). This incremental approach
>> would be easier than implementing (1)+(2) at once.
>>
>> For (1) we could add new pgpool.conf parameter, say
>> "replication_delay_source". If it is set to "builtin", then
>> replication delay source is PostgreSQL as we already does today. If
>> it's set other than "builtin", then it's an external command name (+
>> arguments) to be executed to import replication delay value. The
>> command should return replication delay value represented in strings
>> like "0 20 10", which means node 0, 1 and 2 replication delay values
>> in millisecond (in this case since the node 0 is primary, its
>> replication delay is 0). The command will be invoked every
>> sr_check_period.
>>
>> I am not sure if this actually works in Aurora. This is just a quick
>> idea.
>>
>> (2) would be probably much harder than (1). So we need more discussion
>> later on.
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-08-21 07:38  Nadav Shatz <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Nadav Shatz @ 2025-08-21 07:38 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

Hi Tatsuo,

I'm fine with all of your comments and suggestions.

I'll work on a patch and we can iterate over it.

Hope that's okay.

Best,

On Thu, Aug 21, 2025 at 8:04 AM Tatsuo Ishii <[email protected]> wrote:

> Hi Nadav,
>
> > Hi Tatsuo,
> >
> > Thank you for your reply, I agree with your approach. Better to get (1)
> out
> > of the way first.
> >
> > As a simplest approach that we can implement that would support
> completely
> > offloading the responsibility of the lag checking we can set it to “file”
> > and add another config for file path. Or just if starts with “file:”
> it’ll
> > understand.
>
> My concern about the "file:" approach is, race condition. What if
> pgpool reads the file while it is being updated by someone else?  Also
> I think the command approach is more flexible and generic. For
> example, the "file approch" can be easily simulated by setting the
> command "/usr/bin/cat path_to_the_file".
>
> > Then the internal polling can just read the file on schedule. The entire
> > updating mechanism will be left to the external service.
>
> Internal polling is a little bit complicated and will not be easily
> changed to just reading a file. The internal polling has two options:
> one is checking WAL LSN difference, the other is replication delay in
> time. The file approch would only replace the latter. I suggest to
> leave the internal polling code as it is.
>
> > Having this as a first step also opens up the door for other
> > implementations.
> >
> > Another classic option would be calling an API endpoint. But that might
> > come with a lot more bulk and security concerns.
>
> I agree that calling API could bring security concerns.
>
> BTW, in the command approch, the command should be executed as
> sr_check_user.
>
> > I suggest I work on a patch for file support.
> >
> > What do you think?
>
> For the reason above I prefer the command approch, not the file
> support.
>
> > Nadav Shatz
> > Tailor Brands | CTO
> >
> >
> > On Wed, Aug 20, 2025 at 3:45 PM Tatsuo Ishii <[email protected]>
> wrote:
> >
> >> Hi Nadav,
> >>
> >> Thank you for the answer.
> >>
> >> I think your proposal actually includes two orthogonal proposals.
> >>
> >> (1) "inject" replication delay value from external source (in your
> >> case from Aurora).
> >>
> >> (2) per relation recent access based routing.
> >>
> >> I suggest to implement (1) first, then (2). This incremental approach
> >> would be easier than implementing (1)+(2) at once.
> >>
> >> For (1) we could add new pgpool.conf parameter, say
> >> "replication_delay_source". If it is set to "builtin", then
> >> replication delay source is PostgreSQL as we already does today. If
> >> it's set other than "builtin", then it's an external command name (+
> >> arguments) to be executed to import replication delay value. The
> >> command should return replication delay value represented in strings
> >> like "0 20 10", which means node 0, 1 and 2 replication delay values
> >> in millisecond (in this case since the node 0 is primary, its
> >> replication delay is 0). The command will be invoked every
> >> sr_check_period.
> >>
> >> I am not sure if this actually works in Aurora. This is just a quick
> >> idea.
> >>
> >> (2) would be probably much harder than (1). So we need more discussion
> >> later on.
> >>
> >> Best regards,
> >> --
> >> Tatsuo Ishii
> >> SRA OSS K.K.
> >> English: http://www.sraoss.co.jp/index_en/
> >> Japanese:http://www.sraoss.co.jp
> >>
>


-- 
Nadav Shatz
Tailor Brands | CTO


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-08-21 10:23  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2025-08-21 10:23 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

Hi Nadav,

Thank you for understanding. Please don't hesitate to ask questions
regarding Pgpool-II source code.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

> Hi Tatsuo,
> 
> I'm fine with all of your comments and suggestions.
> 
> I'll work on a patch and we can iterate over it.
> 
> Hope that's okay.
> 
> Best,
> 
> On Thu, Aug 21, 2025 at 8:04 AM Tatsuo Ishii <[email protected]> wrote:
> 
>> Hi Nadav,
>>
>> > Hi Tatsuo,
>> >
>> > Thank you for your reply, I agree with your approach. Better to get (1)
>> out
>> > of the way first.
>> >
>> > As a simplest approach that we can implement that would support
>> completely
>> > offloading the responsibility of the lag checking we can set it to “file”
>> > and add another config for file path. Or just if starts with “file:”
>> it’ll
>> > understand.
>>
>> My concern about the "file:" approach is, race condition. What if
>> pgpool reads the file while it is being updated by someone else?  Also
>> I think the command approach is more flexible and generic. For
>> example, the "file approch" can be easily simulated by setting the
>> command "/usr/bin/cat path_to_the_file".
>>
>> > Then the internal polling can just read the file on schedule. The entire
>> > updating mechanism will be left to the external service.
>>
>> Internal polling is a little bit complicated and will not be easily
>> changed to just reading a file. The internal polling has two options:
>> one is checking WAL LSN difference, the other is replication delay in
>> time. The file approch would only replace the latter. I suggest to
>> leave the internal polling code as it is.
>>
>> > Having this as a first step also opens up the door for other
>> > implementations.
>> >
>> > Another classic option would be calling an API endpoint. But that might
>> > come with a lot more bulk and security concerns.
>>
>> I agree that calling API could bring security concerns.
>>
>> BTW, in the command approch, the command should be executed as
>> sr_check_user.
>>
>> > I suggest I work on a patch for file support.
>> >
>> > What do you think?
>>
>> For the reason above I prefer the command approch, not the file
>> support.
>>
>> > Nadav Shatz
>> > Tailor Brands | CTO
>> >
>> >
>> > On Wed, Aug 20, 2025 at 3:45 PM Tatsuo Ishii <[email protected]>
>> wrote:
>> >
>> >> Hi Nadav,
>> >>
>> >> Thank you for the answer.
>> >>
>> >> I think your proposal actually includes two orthogonal proposals.
>> >>
>> >> (1) "inject" replication delay value from external source (in your
>> >> case from Aurora).
>> >>
>> >> (2) per relation recent access based routing.
>> >>
>> >> I suggest to implement (1) first, then (2). This incremental approach
>> >> would be easier than implementing (1)+(2) at once.
>> >>
>> >> For (1) we could add new pgpool.conf parameter, say
>> >> "replication_delay_source". If it is set to "builtin", then
>> >> replication delay source is PostgreSQL as we already does today. If
>> >> it's set other than "builtin", then it's an external command name (+
>> >> arguments) to be executed to import replication delay value. The
>> >> command should return replication delay value represented in strings
>> >> like "0 20 10", which means node 0, 1 and 2 replication delay values
>> >> in millisecond (in this case since the node 0 is primary, its
>> >> replication delay is 0). The command will be invoked every
>> >> sr_check_period.
>> >>
>> >> I am not sure if this actually works in Aurora. This is just a quick
>> >> idea.
>> >>
>> >> (2) would be probably much harder than (1). So we need more discussion
>> >> later on.
>> >>
>> >> Best regards,
>> >> --
>> >> Tatsuo Ishii
>> >> SRA OSS K.K.
>> >> English: http://www.sraoss.co.jp/index_en/
>> >> Japanese:http://www.sraoss.co.jp
>> >>
>>
> 
> 
> -- 
> Nadav Shatz
> Tailor Brands | CTO


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-08-24 11:11  Nadav Shatz <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Nadav Shatz @ 2025-08-24 11:11 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

Hi Tatsuo,

Here is an initial draft in 2 patches (one for code changes and one for
tests implementation).

Please let me know what you think.

Thank you,

On Thu, Aug 21, 2025 at 1:23 PM Tatsuo Ishii <[email protected]> wrote:

> Hi Nadav,
>
> Thank you for understanding. Please don't hesitate to ask questions
> regarding Pgpool-II source code.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
> > Hi Tatsuo,
> >
> > I'm fine with all of your comments and suggestions.
> >
> > I'll work on a patch and we can iterate over it.
> >
> > Hope that's okay.
> >
> > Best,
> >
> > On Thu, Aug 21, 2025 at 8:04 AM Tatsuo Ishii <[email protected]>
> wrote:
> >
> >> Hi Nadav,
> >>
> >> > Hi Tatsuo,
> >> >
> >> > Thank you for your reply, I agree with your approach. Better to get
> (1)
> >> out
> >> > of the way first.
> >> >
> >> > As a simplest approach that we can implement that would support
> >> completely
> >> > offloading the responsibility of the lag checking we can set it to
> “file”
> >> > and add another config for file path. Or just if starts with “file:”
> >> it’ll
> >> > understand.
> >>
> >> My concern about the "file:" approach is, race condition. What if
> >> pgpool reads the file while it is being updated by someone else?  Also
> >> I think the command approach is more flexible and generic. For
> >> example, the "file approch" can be easily simulated by setting the
> >> command "/usr/bin/cat path_to_the_file".
> >>
> >> > Then the internal polling can just read the file on schedule. The
> entire
> >> > updating mechanism will be left to the external service.
> >>
> >> Internal polling is a little bit complicated and will not be easily
> >> changed to just reading a file. The internal polling has two options:
> >> one is checking WAL LSN difference, the other is replication delay in
> >> time. The file approch would only replace the latter. I suggest to
> >> leave the internal polling code as it is.
> >>
> >> > Having this as a first step also opens up the door for other
> >> > implementations.
> >> >
> >> > Another classic option would be calling an API endpoint. But that
> might
> >> > come with a lot more bulk and security concerns.
> >>
> >> I agree that calling API could bring security concerns.
> >>
> >> BTW, in the command approch, the command should be executed as
> >> sr_check_user.
> >>
> >> > I suggest I work on a patch for file support.
> >> >
> >> > What do you think?
> >>
> >> For the reason above I prefer the command approch, not the file
> >> support.
> >>
> >> > Nadav Shatz
> >> > Tailor Brands | CTO
> >> >
> >> >
> >> > On Wed, Aug 20, 2025 at 3:45 PM Tatsuo Ishii <[email protected]>
> >> wrote:
> >> >
> >> >> Hi Nadav,
> >> >>
> >> >> Thank you for the answer.
> >> >>
> >> >> I think your proposal actually includes two orthogonal proposals.
> >> >>
> >> >> (1) "inject" replication delay value from external source (in your
> >> >> case from Aurora).
> >> >>
> >> >> (2) per relation recent access based routing.
> >> >>
> >> >> I suggest to implement (1) first, then (2). This incremental approach
> >> >> would be easier than implementing (1)+(2) at once.
> >> >>
> >> >> For (1) we could add new pgpool.conf parameter, say
> >> >> "replication_delay_source". If it is set to "builtin", then
> >> >> replication delay source is PostgreSQL as we already does today. If
> >> >> it's set other than "builtin", then it's an external command name (+
> >> >> arguments) to be executed to import replication delay value. The
> >> >> command should return replication delay value represented in strings
> >> >> like "0 20 10", which means node 0, 1 and 2 replication delay values
> >> >> in millisecond (in this case since the node 0 is primary, its
> >> >> replication delay is 0). The command will be invoked every
> >> >> sr_check_period.
> >> >>
> >> >> I am not sure if this actually works in Aurora. This is just a quick
> >> >> idea.
> >> >>
> >> >> (2) would be probably much harder than (1). So we need more
> discussion
> >> >> later on.
> >> >>
> >> >> Best regards,
> >> >> --
> >> >> Tatsuo Ishii
> >> >> SRA OSS K.K.
> >> >> English: http://www.sraoss.co.jp/index_en/
> >> >> Japanese:http://www.sraoss.co.jp
> >> >>
> >>
> >
> >
> > --
> > Nadav Shatz
> > Tailor Brands | CTO
>


-- 
Nadav Shatz
Tailor Brands | CTO


Attachments:

  [application/octet-stream] external-lag-feature-implementation.patch (16.0K, 3-external-lag-feature-implementation.patch)
  download | inline diff:
From 6a1ff112ceb5fa1b6b344769ededfafa55eb8d90 Mon Sep 17 00:00:00 2001
From: Nadav Shatz <[email protected]>
Date: Sun, 24 Aug 2025 13:49:36 +0300
Subject: [PATCH] Add external command replication delay source feature

This patch introduces a comprehensive external command replication delay
source feature that allows pgpool to retrieve replication delay information
from external commands instead of built-in database queries.

Key Features:
- External command execution with configurable timeout (1-3600 seconds)
- Secure command construction with injection protection
- Support for running commands as specific users (sr_check_user)
- Comprehensive input validation and error handling
- Graceful fallback to built-in method on failures

Configuration Options:
- replication_delay_source: 'builtin' (default) or 'cmd'
- replication_delay_source_cmd: external command to execute
- replication_delay_source_timeout: command timeout in seconds (default: 10)

Security Features:
- Command injection protection via proper single-quote escaping
- Safe su command construction preventing malicious execution
- Input validation to prevent injection through delay values
- Comprehensive range validation for delay values

Robustness Features:
- SIGALRM-based timeout mechanism with proper signal handling
- Dynamic buffer allocation (4KB) with truncation detection
- PG_TRY/PG_CATCH blocks for proper error handling and cleanup
- Memory leak prevention in all error paths
- Token count validation ensuring output matches NUM_BACKENDS
- Primary node delay correction (always 0ms)
- Support for both integer and floating-point delay values

Command Format:
External commands should output space-separated delay values in milliseconds:
"node0_delay node1_delay node2_delay ..."
Example: "0 25.5 100" (primary: 0ms, standby1: 25.5ms, standby2: 100ms)

This enables integration with custom monitoring solutions, external
replication lag measurement tools, and enterprise monitoring systems
while maintaining full backward compatibility and security.
---
 src/config/pool_config_variables.c            |  36 ++
 src/include/pool_config.h                     |   9 +
 src/sample/pgpool.conf.sample-stream          |  16 +
 src/streaming_replication/pool_worker_child.c | 333 +++++++++++++++++-
 4 files changed, 393 insertions(+), 1 deletion(-)

diff --git a/src/config/pool_config_variables.c b/src/config/pool_config_variables.c
index 5bbe46d3a..233bada89 100644
--- a/src/config/pool_config_variables.c
+++ b/src/config/pool_config_variables.c
@@ -310,6 +310,12 @@ static const struct config_enum_entry check_temp_table_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry replication_delay_source_options[] = {
+	{"builtin", REPLICATION_DELAY_BUILTIN, false},
+	{"cmd", REPLICATION_DELAY_CMD, false},
+	{NULL, 0, false}
+};
+
 static const struct config_enum_entry log_backend_messages_options[] = {
 	{"none", BGMSG_NONE, false},	/* turn off logging */
 	{"terse", BGMSG_TERSE, false},	/* terse logging (repeated messages are
@@ -980,6 +986,36 @@ static struct config_string ConfigureNamesString[] =
 		NULL, NULL, NULL, NULL
 	},
 
+	{
+		{"replication_delay_source", CFGCXT_RELOAD, STREAMING_REPLICATION_CONFIG,
+			"Source of replication delay information.",
+			CONFIG_VAR_TYPE_ENUM, false, 0
+		},
+		&g_pool_config.replication_delay_source,
+		"builtin",
+		NULL, NULL, NULL, replication_delay_source_options
+	},
+
+	{
+		{"replication_delay_source_cmd", CFGCXT_RELOAD, STREAMING_REPLICATION_CONFIG,
+			"External command to retrieve replication delay information.",
+			CONFIG_VAR_TYPE_STRING, false, 0
+		},
+		&g_pool_config.replication_delay_source_cmd,
+		"",
+		NULL, NULL, NULL, NULL
+	},
+
+	{
+		{"replication_delay_source_timeout", CFGCXT_RELOAD, STREAMING_REPLICATION_CONFIG,
+			"Timeout for external replication delay command execution in seconds.",
+			CONFIG_VAR_TYPE_INT, false, 0
+		},
+		&g_pool_config.replication_delay_source_timeout,
+		10,
+		1, 3600, NULL, NULL
+	},
+
 	{
 		{"failback_command", CFGCXT_RELOAD, FAILOVER_CONFIG,
 			"Command to execute when backend node is attached.",
diff --git a/src/include/pool_config.h b/src/include/pool_config.h
index be82750e5..1a8262dd7 100644
--- a/src/include/pool_config.h
+++ b/src/include/pool_config.h
@@ -94,6 +94,12 @@ typedef enum LogStandbyDelayModes
 	LSD_NONE
 } LogStandbyDelayModes;
 
+typedef enum ReplicationDelaySourceModes
+{
+	REPLICATION_DELAY_BUILTIN = 1,
+	REPLICATION_DELAY_CMD
+} ReplicationDelaySourceModes;
+
 
 typedef enum MemCacheMethod
 {
@@ -371,6 +377,9 @@ typedef struct
 	char	   *sr_check_password;	/* password for sr_check_user */
 	char	   *sr_check_database;	/* PostgreSQL database name for streaming
 									 * replication check */
+	int			replication_delay_source;	/* replication delay source: builtin or cmd */
+	char	   *replication_delay_source_cmd;	/* external command for replication delay */
+	int			replication_delay_source_timeout;	/* timeout for external command in seconds */
 	char	   *failover_command;	/* execute command when failover happens */
 	char	   *follow_primary_command; /* execute command when failover is
 										 * ended */
diff --git a/src/sample/pgpool.conf.sample-stream b/src/sample/pgpool.conf.sample-stream
index a7eb594c9..76d51e0fa 100644
--- a/src/sample/pgpool.conf.sample-stream
+++ b/src/sample/pgpool.conf.sample-stream
@@ -519,6 +519,22 @@ backend_clustering_mode = streaming_replication
 
 #sr_check_database = 'postgres'
                                    # Database name for streaming replication check
+
+#replication_delay_source = 'builtin'
+                                   # Source of replication delay information
+                                   # 'builtin': use built-in database queries (default)
+                                   # 'cmd': use external command
+#replication_delay_source_cmd = ''
+                                   # External command to retrieve replication delay information
+                                   # Only used when replication_delay_source = 'cmd'
+                                   # Command should output delay values in milliseconds
+                                   # Format: "0 20 10" (node0 node1 node2 delays)
+                                   # Command runs with sr_check_user credentials
+#replication_delay_source_timeout = 10
+                                   # Timeout for external command execution in seconds
+                                   # Only used when replication_delay_source = 'cmd'
+                                   # Range: 1-3600 seconds (default: 10)
+
 #delay_threshold = 0
                                    # Threshold before not dispatching query to standby node
                                    # Unit is in bytes
diff --git a/src/streaming_replication/pool_worker_child.c b/src/streaming_replication/pool_worker_child.c
index 4f8f823a3..a80dc27a4 100644
--- a/src/streaming_replication/pool_worker_child.c
+++ b/src/streaming_replication/pool_worker_child.c
@@ -76,6 +76,7 @@ static volatile sig_atomic_t restart_request = 0;
 static void establish_persistent_connection(void);
 static void discard_persistent_connection(void);
 static void check_replication_time_lag(void);
+static void check_replication_time_lag_with_cmd(void);
 static void CheckReplicationTimeLagErrorCb(void *arg);
 static unsigned long long int text_to_lsn(char *text);
 static RETSIGTYPE my_signal_handler(int sig);
@@ -259,7 +260,10 @@ do_worker_child(void)
 					POOL_NODE_STATUS *node_status;
 					int			i;
 
-					/* Do replication time lag checking */
+									/* Do replication time lag checking */
+				if (pool_config->replication_delay_source == REPLICATION_DELAY_CMD)
+					check_replication_time_lag_with_cmd();
+				else
 					check_replication_time_lag();
 
 					/* Check node status */
@@ -659,6 +663,333 @@ check_replication_time_lag(void)
 	error_context_stack = callback.previous;
 }
 
+#define MAX_CMD_OUTPUT 4096
+#define MAX_REASONABLE_DELAY_MS 3600000.0  /* 1 hour in milliseconds */
+
+/* Global variable to track command timeout */
+static volatile sig_atomic_t command_timeout_occurred = 0;
+
+/*
+ * Signal handler for command timeout
+ */
+static void
+command_timeout_handler(int sig)
+{
+	command_timeout_occurred = 1;
+}
+
+/*
+ * Escape single quotes in a string for shell command safety
+ */
+static char *
+escape_single_quotes(const char *input)
+{
+	const char *src;
+	char *result, *dst;
+	int quote_count = 0;
+	int len;
+
+	/* Count single quotes to determine result size */
+	for (src = input; *src; src++)
+	{
+		if (*src == '\'')
+			quote_count++;
+	}
+
+	/* Allocate result: original length + 3 chars per quote (replace ' with '\''') */
+	len = strlen(input) + (quote_count * 3) + 1;
+	result = palloc(len);
+
+	/* Copy and escape */
+	dst = result;
+	for (src = input; *src; src++)
+	{
+		if (*src == '\'')
+		{
+			/* Replace ' with '\'' */
+			*dst++ = '\'';
+			*dst++ = '\\';
+			*dst++ = '\'';
+			*dst++ = '\'';
+		}
+		else
+		{
+			*dst++ = *src;
+		}
+	}
+	*dst = '\0';
+
+	return result;
+}
+
+/*
+ * Check replication time lag using external command
+ */
+static void
+check_replication_time_lag_with_cmd(void)
+{
+	FILE		   *fp;
+	char		   *command = NULL;
+	char		   *escaped_cmd = NULL;
+	char		   *line;
+	char		   *token;
+	char		   *saveptr;
+	int				node_id;
+	double			delay_ms;
+	uint64			delay;
+	int				token_count = 0;
+	BackendInfo	   *bkinfo;
+	ErrorContextCallback callback;
+	bool			cmd_allocated = false;
+
+	if (NUM_BACKENDS <= 1)
+	{
+		/* If there's only one node, there's no point to do checking */
+		return;
+	}
+
+	if (REAL_PRIMARY_NODE_ID < 0)
+	{
+		/* No need to check if there's no primary */
+		return;
+	}
+
+	if (!VALID_BACKEND(REAL_PRIMARY_NODE_ID))
+	{
+		/* No need to check replication delay if primary is down */
+		return;
+	}
+
+	if (!pool_config->replication_delay_source_cmd ||
+		strlen(pool_config->replication_delay_source_cmd) == 0)
+	{
+		ereport(WARNING,
+				(errmsg("replication_delay_source is set to 'cmd' but replication_delay_source_cmd is not configured"),
+				 errhint("Set replication_delay_source_cmd or change replication_delay_source to 'builtin'")));
+		/* Fall back to builtin method */
+		check_replication_time_lag();
+		return;
+	}
+
+	/* Allocate buffer for command output */
+	line = palloc(MAX_CMD_OUTPUT);
+
+	/*
+	 * Register a error context callback to throw proper context message
+	 */
+	callback.callback = CheckReplicationTimeLagErrorCb;
+	callback.arg = NULL;
+	callback.previous = error_context_stack;
+	error_context_stack = &callback;
+
+	/* Build command to run as sr_check_user if specified */
+	PG_TRY();
+	{
+		if (pool_config->sr_check_user && strlen(pool_config->sr_check_user) > 0)
+		{
+			char *full_command;
+			int cmd_len;
+			
+			/* Escape the command to prevent injection */
+			escaped_cmd = escape_single_quotes(pool_config->replication_delay_source_cmd);
+			
+			cmd_len = strlen(escaped_cmd) + 
+					  strlen(pool_config->sr_check_user) + 20; /* extra space for "su - user -c ''" */
+			
+			full_command = palloc(cmd_len);
+			snprintf(full_command, cmd_len, "su - %s -c '%s'", 
+					 pool_config->sr_check_user, escaped_cmd);
+			command = full_command;
+			cmd_allocated = true;
+		}
+		else
+		{
+			command = pool_config->replication_delay_source_cmd;
+		}
+
+		ereport(DEBUG1,
+				(errmsg("executing replication delay command: %s", command)));
+
+		/* Set up timeout for command execution */
+		command_timeout_occurred = 0;
+		signal(SIGALRM, command_timeout_handler);
+		alarm(pool_config->replication_delay_source_timeout);
+
+		fp = popen(command, "r");
+		if (fp == NULL)
+		{
+			alarm(0); /* Cancel alarm */
+			signal(SIGALRM, SIG_DFL);
+			ereport(ERROR,
+					(errmsg("failed to execute replication delay command: %s", command),
+					 errdetail("popen failed: %m")));
+		}
+
+		if (fgets(line, MAX_CMD_OUTPUT, fp) == NULL)
+		{
+			int pclose_result = pclose(fp);
+			fp = NULL;
+			alarm(0); /* Cancel alarm */
+			signal(SIGALRM, SIG_DFL);
+			
+			if (command_timeout_occurred)
+			{
+				ereport(ERROR,
+						(errmsg("replication delay command timed out after %d seconds: %s",
+								pool_config->replication_delay_source_timeout, command),
+						 errhint("Consider increasing replication_delay_source_timeout or optimizing the command")));
+			}
+			else
+			{
+				ereport(ERROR,
+						(errmsg("failed to read output from replication delay command: %s", command),
+						 errdetail("command exit status: %d", pclose_result)));
+			}
+		}
+
+		alarm(0); /* Cancel alarm */
+		signal(SIGALRM, SIG_DFL);
+
+		/* Check if output was truncated */
+		if (strlen(line) == MAX_CMD_OUTPUT - 1 && line[MAX_CMD_OUTPUT - 2] != '\n')
+		{
+			ereport(WARNING,
+					(errmsg("replication delay command output may have been truncated")));
+		}
+
+		pclose(fp);
+		fp = NULL;
+
+		/* Parse the output format "0 20 10" where each number is delay in milliseconds for nodes 0, 1, 2 etc */
+		/* Count tokens first for validation */
+		char *line_copy = pstrdup(line);
+		char *temp_token = strtok(line_copy, " \t\n");
+		while (temp_token != NULL)
+		{
+			token_count++;
+			temp_token = strtok(NULL, " \t\n");
+		}
+		pfree(line_copy);
+
+		/* Now parse the actual tokens */
+		token = strtok_r(line, " \t\n", &saveptr);
+		node_id = 0;
+
+		if (token_count != NUM_BACKENDS)
+		{
+			ereport(WARNING,
+					(errmsg("replication delay command returned %d values, expected %d",
+							token_count, NUM_BACKENDS),
+					 errhint("Command should output one delay value per backend node")));
+		}
+
+		while (token != NULL && node_id < NUM_BACKENDS)
+		{
+			if (!VALID_BACKEND(node_id))
+			{
+				node_id++;
+				token = strtok_r(NULL, " \t\n", &saveptr);
+				continue;
+			}
+
+			char *endptr;
+			delay_ms = strtod(token, &endptr);
+			
+			/* Validate the conversion */
+			if (*endptr != '\0')
+			{
+				ereport(WARNING,
+						(errmsg("invalid delay value '%s' for node %d, treating as 0",
+								token, node_id)));
+				delay_ms = 0;
+			}
+			
+			/* Validate delay value range */
+			if (delay_ms < 0)
+			{
+				ereport(WARNING,
+						(errmsg("negative delay value %.3f for node %d, treating as 0",
+								delay_ms, node_id)));
+				delay_ms = 0;
+			}
+			else if (delay_ms > MAX_REASONABLE_DELAY_MS)
+			{
+				ereport(WARNING,
+						(errmsg("extremely large delay value %.3f for node %d",
+								delay_ms, node_id)));
+			}
+
+			bkinfo = pool_get_node_info(node_id);
+
+			if (PRIMARY_NODE_ID == node_id)
+			{
+				/* Primary node should always have 0 delay */
+				bkinfo->standby_delay = 0;
+				if (delay_ms > 0)
+				{
+					ereport(DEBUG1,
+							(errmsg("primary node %d reported non-zero delay %.3f, setting to 0",
+									node_id, delay_ms)));
+				}
+			}
+			else
+			{
+				/* Convert delay from milliseconds to microseconds for internal storage */
+				delay = (uint64)(delay_ms * 1000);
+				bkinfo->standby_delay = delay;
+				bkinfo->standby_delay_by_time = true;
+
+				/* Log delay if necessary */
+				uint64 delay_threshold_by_time = pool_config->delay_threshold_by_time * 1000; /* threshold is in milliseconds, convert to microseconds */
+
+				if ((pool_config->log_standby_delay == LSD_ALWAYS && delay_ms > 0) ||
+					(pool_config->log_standby_delay == LSD_OVER_THRESHOLD &&
+					 bkinfo->standby_delay > delay_threshold_by_time))
+				{
+					ereport(LOG,
+							(errmsg("Replication of node: %d is behind %.3f second(s) from the primary server (node: %d) [external command]",
+									node_id, delay_ms / 1000, PRIMARY_NODE_ID)));
+				}
+			}
+
+			node_id++;
+			token = strtok_r(NULL, " \t\n", &saveptr);
+		}
+
+	}
+	}
+	PG_CATCH();
+	{
+		/* Cleanup in case of error */
+		alarm(0); /* Cancel any pending alarm */
+		signal(SIGALRM, SIG_DFL);
+		if (fp)
+		{
+			pclose(fp);
+			fp = NULL;
+		}
+		if (line)
+			pfree(line);
+		if (escaped_cmd)
+			pfree(escaped_cmd);
+		if (cmd_allocated && command)
+			pfree(command);
+		error_context_stack = callback.previous;
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	/* Normal cleanup */
+	if (line)
+		pfree(line);
+	if (escaped_cmd)
+		pfree(escaped_cmd);
+	if (cmd_allocated && command)
+		pfree(command);
+	
+	error_context_stack = callback.previous;
+}
+
 static void
 CheckReplicationTimeLagErrorCb(void *arg)
 {
-- 
2.51.0



  [application/octet-stream] external-lag-feature-tests.patch (25.7K, 4-external-lag-feature-tests.patch)
  download | inline diff:
From 4d843e69bacb6c1806fe2409a7c304f4fed7f385 Mon Sep 17 00:00:00 2001
From: Nadav Shatz <[email protected]>
Date: Sun, 24 Aug 2025 13:49:54 +0300
Subject: [PATCH] Add comprehensive test suite for external replication delay
 feature

This patch adds a complete test suite for the external command replication
delay source feature with comprehensive coverage of all functionality and
edge cases.

Test Coverage:
- Basic external command execution with integer millisecond values
- Floating-point millisecond value parsing and handling
- Delay threshold functionality with external commands
- User switching with sr_check_user parameter
- Error handling for missing/invalid commands and execution failures
- Command timeout handling with configurable timeout values
- Input validation for invalid, negative, and extremely large delay values
- Handling of wrong number of output values from commands
- Primary node delay correction (always 0ms)
- Output truncation detection and warnings
- Timeout behavior with both short and long timeout values

Test Files:
- test.sh: Main regression test with 7 comprehensive test scenarios
- test_validation.sh: Validation and edge case testing with 6 test scenarios
- test_parsing.sh: Unit test for parsing logic and output format validation
- README: Complete documentation of test coverage and expected behavior

Test Improvements:
- Intelligent wait loops replacing fixed sleeps for better reliability
- Proper error detection and reporting mechanisms
- Comprehensive log analysis and validation
- Better progress reporting during test execution
- Deterministic timing to reduce test flakiness

The test suite ensures the external command feature integrates properly with
existing pgpool functionality and handles various edge cases gracefully.
Tests follow existing pgpool regression test patterns and will be
automatically discovered by the test runner.

Expected Behavior:
- External commands should be executed as configured
- Delay values should be parsed correctly (both int and float)
- Threshold comparisons should work properly with external delays
- Error conditions should be handled gracefully with proper fallbacks
- Commands should timeout appropriately based on configuration
- Timeout errors should provide helpful messages and hints
---
 .../083.external_replication_delay/README     |  43 +++
 .../083.external_replication_delay/test.sh    | 304 ++++++++++++++++++
 .../test_parsing.sh                           |  54 ++++
 .../test_validation.sh                        | 285 ++++++++++++++++
 4 files changed, 686 insertions(+)
 create mode 100644 src/test/regression/tests/083.external_replication_delay/README
 create mode 100755 src/test/regression/tests/083.external_replication_delay/test.sh
 create mode 100755 src/test/regression/tests/083.external_replication_delay/test_parsing.sh
 create mode 100755 src/test/regression/tests/083.external_replication_delay/test_validation.sh

diff --git a/src/test/regression/tests/083.external_replication_delay/README b/src/test/regression/tests/083.external_replication_delay/README
new file mode 100644
index 000000000..1808ba854
--- /dev/null
+++ b/src/test/regression/tests/083.external_replication_delay/README
@@ -0,0 +1,43 @@
+External Replication Delay Command Test
+========================================
+
+This test verifies the external command replication delay source feature.
+
+Test Coverage:
+- Basic external command execution with integer millisecond values
+- Floating-point millisecond value parsing
+- Delay threshold functionality with external commands
+- User switching with sr_check_user parameter
+- Error handling for missing/invalid commands
+- Command execution failure scenarios
+- Command timeout handling with configurable timeout values
+- Input validation for invalid, negative, and extremely large delay values
+- Handling of wrong number of output values
+- Primary node delay correction
+- Output truncation detection
+- Timeout behavior with both short and long timeout values
+
+Files:
+- test.sh: Main test script
+- test_parsing.sh: Unit test for parsing logic  
+- test_validation.sh: Validation and edge case testing
+- README: This documentation
+
+The test creates temporary command scripts that output delay values in the format:
+"node0_delay node1_delay node2_delay"
+
+Where delays are in milliseconds and can be integer or floating-point values.
+
+Test Environment:
+- Uses streaming replication mode with 3 nodes
+- Configures sr_check_period = 1 second for faster testing
+- Tests various delay scenarios and threshold behaviors
+
+Expected Behavior:
+- External commands should be executed as configured
+- Delay values should be parsed correctly (both int and float)
+- Threshold comparisons should work properly
+- Error conditions should be handled gracefully
+- Commands should timeout appropriately based on configuration
+- Timeout errors should provide helpful messages and hints
+- Tests should be reliable with proper wait mechanisms instead of fixed sleeps
\ No newline at end of file
diff --git a/src/test/regression/tests/083.external_replication_delay/test.sh b/src/test/regression/tests/083.external_replication_delay/test.sh
new file mode 100755
index 000000000..044ef341e
--- /dev/null
+++ b/src/test/regression/tests/083.external_replication_delay/test.sh
@@ -0,0 +1,304 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for external command replication delay source
+#
+source $TESTLIBS
+TESTDIR=testdir
+PG_CTL=$PGBIN/pg_ctl
+PSQL="$PGBIN/psql -X "
+
+rm -fr $TESTDIR
+mkdir $TESTDIR
+cd $TESTDIR
+
+# create test environment
+echo -n "creating test environment..."
+$PGPOOL_SETUP -m s -n 3 || exit 1
+echo "done."
+source ./bashrc.ports
+export PGPORT=$PGPOOL_PORT
+
+# Create external command scripts for testing
+cat > delay_cmd_static.sh << 'EOF'
+#!/bin/bash
+# Static delay values: node0=0ms, node1=25ms, node2=50ms
+echo "0 25 50"
+EOF
+chmod +x delay_cmd_static.sh
+
+cat > delay_cmd_float.sh << 'EOF'
+#!/bin/bash
+# Float delay values: node0=0ms, node1=25.5ms, node2=100.75ms
+echo "0 25.5 100.75"
+EOF
+chmod +x delay_cmd_float.sh
+
+cat > delay_cmd_high.sh << 'EOF'
+#!/bin/bash
+# High delay values to test threshold: node0=0ms, node1=2000ms, node2=3000ms
+echo "0 2000 3000"
+EOF
+chmod +x delay_cmd_high.sh
+
+# ----------------------------------------------------------------------------------------
+echo === Test1: Basic external command with integer millisecond values ===
+# ----------------------------------------------------------------------------------------
+echo "replication_delay_source = 'cmd'" >> etc/pgpool.conf
+echo "replication_delay_source_cmd = './delay_cmd_static.sh'" >> etc/pgpool.conf
+echo "sr_check_period = 1" >> etc/pgpool.conf
+echo "log_standby_delay = 'always'" >> etc/pgpool.conf
+echo "log_min_messages = 'DEBUG1'" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+$PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+EOF
+
+# Wait for sr_check to run and populate delay values
+# sr_check_period is 1 second, so wait a bit longer to ensure it runs
+echo "Waiting for sr_check to run..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command" log/pgpool.log 2>/dev/null; then
+        echo "Command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SHOW POOL_NODES;
+EOF
+
+# Check that delay values are populated in the log
+grep "executing replication delay command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: external command was not executed
+    echo "Log contents:"
+    tail -20 log/pgpool.log
+    ./shutdownall
+    exit 1
+fi
+
+# Verify actual delay values were parsed
+if ! $PSQL -t -c "SHOW POOL_NODES" test | grep -E "[0-9]+\.[0-9]+" >/dev/null; then
+    echo "Warning: No delay values found in POOL_NODES output"
+fi
+
+# Check for delay log messages
+grep "Replication of node.*external command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: external command delay logging not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: basic external command test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test2: External command with floating-point millisecond values ===
+# ----------------------------------------------------------------------------------------
+# Update configuration to use float command
+sed -i.bak "s|delay_cmd_static.sh|delay_cmd_float.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run with float values
+echo "Waiting for sr_check with float values..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*delay_cmd_float.sh" log/pgpool.log 2>/dev/null; then
+        echo "Float command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SHOW POOL_NODES;
+EOF
+
+# Check that float values are handled correctly
+grep "executing replication delay command.*delay_cmd_float.sh" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: float command was not executed
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: floating-point values test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test3: External command with delay threshold ===
+# ----------------------------------------------------------------------------------------
+# Update configuration to use high delay command and set threshold
+sed -i.bak "s|delay_cmd_float.sh|delay_cmd_high.sh|" etc/pgpool.conf
+echo "delay_threshold_by_time = 1000" >> etc/pgpool.conf
+echo "backend_weight0 = 0" >> etc/pgpool.conf  # Force queries to standby normally
+echo "backend_weight2 = 0" >> etc/pgpool.conf  # Only use node 1 as standby
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run and detect high delays
+echo "Waiting for sr_check with high delay values..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*delay_cmd_high.sh" log/pgpool.log 2>/dev/null; then
+        echo "High delay command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SELECT * FROM t1 LIMIT 1;
+EOF
+
+# With high delays (2000ms > 1000ms threshold), query should go to primary (node 0)
+grep "SELECT \* FROM t1 LIMIT 1.*DB node id: 0" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: query was not sent to primary node despite high delay
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: delay threshold test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test4: External command with sr_check_user ===
+# ----------------------------------------------------------------------------------------
+# Test running command as specific user (using current user for test)
+CURRENT_USER=$(whoami)
+echo "sr_check_user = '$CURRENT_USER'" >> etc/pgpool.conf
+sed -i.bak "s|delay_cmd_high.sh|delay_cmd_static.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run with user switching
+echo "Waiting for sr_check with user switching..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*su.*$CURRENT_USER" log/pgpool.log 2>/dev/null; then
+        echo "User switching command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check that command was executed with su
+grep "executing replication delay command.*su.*$CURRENT_USER.*delay_cmd_static.sh" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: command was not executed with sr_check_user
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: sr_check_user test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test5: Error handling - missing command ===
+# ----------------------------------------------------------------------------------------
+# Test error handling when command is not configured
+sed -i.bak "s|replication_delay_source_cmd = './delay_cmd_static.sh'|replication_delay_source_cmd = ''|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run with missing command
+echo "Waiting for sr_check with missing command..."
+for i in {1..5}; do
+    if grep -q "replication_delay_source_cmd is not configured" log/pgpool.log 2>/dev/null; then
+        echo "Missing command error detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for error message about missing command
+grep "replication_delay_source_cmd is not configured" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: missing command error not detected
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: error handling test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test6: Error handling - command execution failure ===
+# ----------------------------------------------------------------------------------------
+# Test error handling when command fails
+echo "replication_delay_source_cmd = './nonexistent_command.sh'" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run with failing command
+echo "Waiting for sr_check with failing command..."
+for i in {1..5}; do
+    if grep -q "failed to execute replication delay command" log/pgpool.log 2>/dev/null; then
+        echo "Command failure detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for error message about command execution failure
+grep "failed to execute replication delay command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: command execution failure not detected
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: command failure test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test7: Command timeout handling ===
+# ----------------------------------------------------------------------------------------
+# Create a command that takes longer than the timeout
+cat > delay_cmd_slow.sh << 'EOF'
+#!/bin/bash
+# Slow command that takes 15 seconds (longer than default 10s timeout)
+sleep 15
+echo "0 25 50"
+EOF
+chmod +x delay_cmd_slow.sh
+
+# Set a short timeout and use the slow command
+sed -i.bak "s|replication_delay_source_cmd = './nonexistent_command.sh'|replication_delay_source_cmd = './delay_cmd_slow.sh'|" etc/pgpool.conf
+echo "replication_delay_source_timeout = 3" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run and timeout
+echo "Waiting for command timeout..."
+for i in {1..15}; do
+    if grep -q "replication delay command timed out" log/pgpool.log 2>/dev/null; then
+        echo "Command timeout detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for timeout error message
+grep "replication delay command timed out after 3 seconds" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: command timeout not detected
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: command timeout test succeeded
+./shutdownall
+
+echo "All external replication delay tests passed!"
+exit 0
\ No newline at end of file
diff --git a/src/test/regression/tests/083.external_replication_delay/test_parsing.sh b/src/test/regression/tests/083.external_replication_delay/test_parsing.sh
new file mode 100755
index 000000000..143da337e
--- /dev/null
+++ b/src/test/regression/tests/083.external_replication_delay/test_parsing.sh
@@ -0,0 +1,54 @@
+#!/bin/bash
+#-------------------------------------------------------------------
+# Unit test for external command parsing logic
+# This tests the parsing without needing a full pgpool setup
+#
+
+echo "=== Testing external command output parsing ==="
+
+# Test 1: Integer values
+echo "Test 1: Integer millisecond values"
+echo "0 25 50" > test_output.txt
+echo "Expected: 0ms, 25ms, 50ms"
+echo "Output: $(cat test_output.txt)"
+echo ""
+
+# Test 2: Float values  
+echo "Test 2: Floating-point millisecond values"
+echo "0 25.5 100.75" > test_output_float.txt
+echo "Expected: 0ms, 25.5ms, 100.75ms"
+echo "Output: $(cat test_output_float.txt)"
+echo ""
+
+# Test 3: High precision float values
+echo "Test 3: High precision values"
+echo "0 0.001 999.999" > test_output_precision.txt
+echo "Expected: 0ms, 0.001ms, 999.999ms"
+echo "Output: $(cat test_output_precision.txt)"
+echo ""
+
+# Test 4: Edge case - zero values
+echo "Test 4: All zero values"
+echo "0 0 0" > test_output_zeros.txt
+echo "Expected: 0ms, 0ms, 0ms"
+echo "Output: $(cat test_output_zeros.txt)"
+echo ""
+
+# Test 5: Edge case - large values
+echo "Test 5: Large delay values"
+echo "0 5000 10000" > test_output_large.txt
+echo "Expected: 0ms, 5000ms, 10000ms"
+echo "Output: $(cat test_output_large.txt)"
+echo ""
+
+# Test 6: Mixed integer and float
+echo "Test 6: Mixed integer and float values"
+echo "0 25 50.5" > test_output_mixed.txt
+echo "Expected: 0ms, 25ms, 50.5ms"
+echo "Output: $(cat test_output_mixed.txt)"
+echo ""
+
+# Cleanup
+rm -f test_output*.txt
+
+echo "All parsing tests completed. These outputs should be parseable by the external command feature."
\ No newline at end of file
diff --git a/src/test/regression/tests/083.external_replication_delay/test_validation.sh b/src/test/regression/tests/083.external_replication_delay/test_validation.sh
new file mode 100755
index 000000000..4be884c4e
--- /dev/null
+++ b/src/test/regression/tests/083.external_replication_delay/test_validation.sh
@@ -0,0 +1,285 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for external command validation and edge cases
+#
+source $TESTLIBS
+TESTDIR=testdir_validation
+PG_CTL=$PGBIN/pg_ctl
+PSQL="$PGBIN/psql -X "
+
+rm -fr $TESTDIR
+mkdir $TESTDIR
+cd $TESTDIR
+
+# create test environment
+echo -n "creating test environment..."
+$PGPOOL_SETUP -m s -n 3 || exit 1
+echo "done."
+source ./bashrc.ports
+export PGPORT=$PGPOOL_PORT
+
+# Create test command scripts
+cat > delay_cmd_validation.sh << 'EOF'
+#!/bin/bash
+# Test validation: output with invalid values
+echo "0 invalid_value 50.5"
+EOF
+chmod +x delay_cmd_validation.sh
+
+cat > delay_cmd_negative.sh << 'EOF'
+#!/bin/bash
+# Test negative values
+echo "0 -25 50"
+EOF
+chmod +x delay_cmd_negative.sh
+
+cat > delay_cmd_large.sh << 'EOF'
+#!/bin/bash
+# Test extremely large values
+echo "0 9999999 50"
+EOF
+chmod +x delay_cmd_large.sh
+
+cat > delay_cmd_wrong_count.sh << 'EOF'
+#!/bin/bash
+# Test wrong number of values (only 2 instead of 3)
+echo "0 25"
+EOF
+chmod +x delay_cmd_wrong_count.sh
+
+cat > delay_cmd_truncated.sh << 'EOF'
+#!/bin/bash
+# Test output that might be truncated (very long line)
+printf "0 25 "
+for i in {1..1000}; do printf "very_long_output_"; done
+echo "50"
+EOF
+chmod +x delay_cmd_truncated.sh
+
+# ----------------------------------------------------------------------------------------
+echo === Test1: Validation of invalid delay values ===
+# ----------------------------------------------------------------------------------------
+echo "replication_delay_source = 'cmd'" >> etc/pgpool.conf
+echo "replication_delay_source_cmd = './delay_cmd_validation.sh'" >> etc/pgpool.conf
+echo "sr_check_period = 1" >> etc/pgpool.conf
+echo "log_standby_delay = 'always'" >> etc/pgpool.conf
+echo "log_min_messages = 'DEBUG1'" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+$PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+EOF
+
+# Wait for sr_check to run
+echo "Waiting for validation test..."
+for i in {1..10}; do
+    if grep -q "invalid delay value" log/pgpool.log 2>/dev/null; then
+        echo "Validation error detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for validation warning
+grep "invalid delay value 'invalid_value' for node" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: validation warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: invalid value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test2: Negative delay values ===
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_validation.sh|delay_cmd_negative.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for negative value test..."
+for i in {1..10}; do
+    if grep -q "negative delay value" log/pgpool.log 2>/dev/null; then
+        echo "Negative value warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for negative value warning
+grep "negative delay value.*for node" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: negative value warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: negative value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test3: Extremely large delay values ===
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_negative.sh|delay_cmd_large.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for large value test..."
+for i in {1..10}; do
+    if grep -q "extremely large delay value" log/pgpool.log 2>/dev/null; then
+        echo "Large value warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for large value warning
+grep "extremely large delay value.*for node" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: large value warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: large value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test4: Wrong number of output values ===
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_large.sh|delay_cmd_wrong_count.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for wrong count test..."
+for i in {1..10}; do
+    if grep -q "returned.*values, expected" log/pgpool.log 2>/dev/null; then
+        echo "Wrong count warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for wrong count warning
+grep "returned.*values, expected.*Command should output one delay value per backend node" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: wrong count warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: wrong count validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test5: Primary node non-zero delay handling ===
+# ----------------------------------------------------------------------------------------
+cat > delay_cmd_primary_nonzero.sh << 'EOF'
+#!/bin/bash
+# Test primary node with non-zero delay (should be corrected to 0)
+echo "100 25 50"
+EOF
+chmod +x delay_cmd_primary_nonzero.sh
+
+sed -i.bak "s|delay_cmd_wrong_count.sh|delay_cmd_primary_nonzero.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for primary non-zero delay test..."
+for i in {1..10}; do
+    if grep -q "primary node.*reported non-zero delay" log/pgpool.log 2>/dev/null; then
+        echo "Primary non-zero delay detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for primary node correction
+grep "primary node.*reported non-zero delay.*setting to 0" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: primary node delay correction not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: primary node delay correction test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test6: Command timeout with different timeout values ===
+# ----------------------------------------------------------------------------------------
+cat > delay_cmd_timeout.sh << 'EOF'
+#!/bin/bash
+# Command that takes 5 seconds
+sleep 5
+echo "0 25 50"
+EOF
+chmod +x delay_cmd_timeout.sh
+
+# Test with timeout shorter than command duration
+sed -i.bak "s|delay_cmd_primary_nonzero.sh|delay_cmd_timeout.sh|" etc/pgpool.conf
+echo "replication_delay_source_timeout = 2" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for timeout
+echo "Waiting for timeout test (2s timeout, 5s command)..."
+for i in {1..10}; do
+    if grep -q "replication delay command timed out after 2 seconds" log/pgpool.log 2>/dev/null; then
+        echo "Timeout detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for timeout message
+grep "replication delay command timed out after 2 seconds" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: timeout not detected
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: timeout test succeeded
+./shutdownall
+
+# Test with timeout longer than command duration
+sed -i.bak "s|replication_delay_source_timeout = 2|replication_delay_source_timeout = 10|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for successful execution
+echo "Waiting for successful execution (10s timeout, 5s command)..."
+for i in {1..15}; do
+    if grep -q "executing replication delay command.*delay_cmd_timeout.sh" log/pgpool.log 2>/dev/null; then
+        echo "Command executed successfully after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Should not timeout this time
+if grep -q "replication delay command timed out" log/pgpool.log 2>/dev/null; then
+    echo fail: command should not have timed out with 10s timeout
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: extended timeout test succeeded
+./shutdownall
+
+echo "All validation tests passed!"
+exit 0
\ No newline at end of file
-- 
2.51.0



^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-08-25 02:18  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2025-08-25 02:18 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

Hi Nadav,

Thank you for the patch!

I have one question. How do you provide a password (sr_check_password)
while executing replication_delay_source_cmd as sr_check_user? In my
understanding replication_delay_source_cmd is executed through su
command in your patch. In this case su command tries to read the
password from terminal. I don't see such a code in the patch.

BTW, I start to think that executing replication_delay_source_cmd as
sr_check_user might not be a good idea. sr_check_user is a database
user, not OS user. In PostgreSQL they are not necessarily the
same. Also doing su in pgpool process needs to be very carefully to
avoid vulnerability. Probably we just execute it as pgpool OS user?

Lastly when I apply the patches using git apply, there are some
trailing space errors.

$ git apply ~/external-lag-feature-implementation.patch 
/home/t-ishii/external-lag-feature-implementation.patch:314: trailing whitespace.
			
/home/t-ishii/external-lag-feature-implementation.patch:317: trailing whitespace.
			
/home/t-ishii/external-lag-feature-implementation.patch:318: trailing whitespace.
			cmd_len = strlen(escaped_cmd) + 
/home/t-ishii/external-lag-feature-implementation.patch:320: trailing whitespace.
			
/home/t-ishii/external-lag-feature-implementation.patch:322: trailing whitespace.
			snprintf(full_command, cmd_len, "su - %s -c '%s'", 
warning: squelched 4 whitespace errors
warning: 9 lines add whitespace errors.

$ git apply ~/external-lag-feature-tests.patch 
/home/t-ishii/external-lag-feature-tests.patch:87: trailing whitespace.
- test_parsing.sh: Unit test for parsing logic  
/home/t-ishii/external-lag-feature-tests.patch:440: trailing whitespace.
# Test 2: Float values  
warning: 2 lines add whitespace errors.

Also I have some compilation errors after patching the source
code. See attached compilation log.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Making all in src
make[1]: Entering directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src'
Making all in parser
make[2]: Entering directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src/parser'
make[2]: Nothing to be done for 'all'.
make[2]: Leaving directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src/parser'
Making all in libs
make[2]: Entering directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src/libs'
Making all in pcp
make[3]: Entering directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src/libs/pcp'
make[3]: Nothing to be done for 'all'.
make[3]: Leaving directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src/libs/pcp'
make[3]: Entering directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src/libs'
make[3]: Nothing to be done for 'all-am'.
make[3]: Leaving directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src/libs'
make[2]: Leaving directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src/libs'
Making all in watchdog
make[2]: Entering directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src/watchdog'
make[2]: Nothing to be done for 'all'.
make[2]: Leaving directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src/watchdog'
Making all in .
make[2]: Entering directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src'
gcc -DHAVE_CONFIG_H -DDEFAULT_CONFIGDIR=\"/usr/local/etc\" -DPGSQL_BIN_DIR=\"/usr/local/pgsql/bin\" -I. -I../src/include  -D_GNU_SOURCE -I /usr/local/pgsql/include   -g -O2 -Wall -Wmissing-prototypes -Wmissing-declarations -Wno-format-truncation -Wno-stringop-truncation -fno-strict-aliasing -c -o streaming_replication/pool_worker_child.o streaming_replication/pool_worker_child.c
streaming_replication/pool_worker_child.c: In function 'do_worker_child':
streaming_replication/pool_worker_child.c:266:5: warning: this 'else' clause does not guard... [-Wmisleading-indentation]
  266 |     else
      |     ^~~~
streaming_replication/pool_worker_child.c:270:6: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the 'else'
  270 |      node_status = verify_backend_node_status(slots);
      |      ^~~~~~~~~~~
In file included from streaming_replication/pool_worker_child.c:56:
streaming_replication/pool_worker_child.c: In function 'check_replication_time_lag_with_cmd':
../src/include/utils/elog.h:397:25: warning: unused variable 'save_context_stack' [-Wunused-variable]
  397 |   ErrorContextCallback *save_context_stack = error_context_stack; \
      |                         ^~~~~~~~~~~~~~~~~~
streaming_replication/pool_worker_child.c:786:2: note: in expansion of macro 'PG_TRY'
  786 |  PG_TRY();
      |  ^~~~~~
../src/include/utils/elog.h:396:15: warning: unused variable 'save_exception_stack' [-Wunused-variable]
  396 |   sigjmp_buf *save_exception_stack = PG_exception_stack; \
      |               ^~~~~~~~~~~~~~~~~~~~
streaming_replication/pool_worker_child.c:786:2: note: in expansion of macro 'PG_TRY'
  786 |  PG_TRY();
      |  ^~~~~~
../src/include/utils/elog.h:405:3: error: expected 'while' before 'else'
  405 |   else \
      |   ^~~~
streaming_replication/pool_worker_child.c:961:2: note: in expansion of macro 'PG_CATCH'
  961 |  PG_CATCH();
      |  ^~~~~~~~
../src/include/utils/elog.h:412:24: error: 'save_exception_stack' undeclared (first use in this function); did you mean 'PG_exception_stack'?
  412 |   PG_exception_stack = save_exception_stack; \
      |                        ^~~~~~~~~~~~~~~~~~~~
streaming_replication/pool_worker_child.c:980:2: note: in expansion of macro 'PG_END_TRY'
  980 |  PG_END_TRY();
      |  ^~~~~~~~~~
../src/include/utils/elog.h:412:24: note: each undeclared identifier is reported only once for each function it appears in
  412 |   PG_exception_stack = save_exception_stack; \
      |                        ^~~~~~~~~~~~~~~~~~~~
streaming_replication/pool_worker_child.c:980:2: note: in expansion of macro 'PG_END_TRY'
  980 |  PG_END_TRY();
      |  ^~~~~~~~~~
../src/include/utils/elog.h:413:25: error: 'save_context_stack' undeclared (first use in this function); did you mean 'error_context_stack'?
  413 |   error_context_stack = save_context_stack; \
      |                         ^~~~~~~~~~~~~~~~~~
streaming_replication/pool_worker_child.c:980:2: note: in expansion of macro 'PG_END_TRY'
  980 |  PG_END_TRY();
      |  ^~~~~~~~~~
streaming_replication/pool_worker_child.c:743:9: warning: variable 'cmd_allocated' set but not used [-Wunused-but-set-variable]
  743 |  bool   cmd_allocated = false;
      |         ^~~~~~~~~~~~~
In file included from streaming_replication/pool_worker_child.c:56:
streaming_replication/pool_worker_child.c: At top level:
../src/include/utils/elog.h:414:4: error: expected identifier or '(' before 'while'
  414 |  } while (0)
      |    ^~~~~
streaming_replication/pool_worker_child.c:980:2: note: in expansion of macro 'PG_END_TRY'
  980 |  PG_END_TRY();
      |  ^~~~~~~~~~
streaming_replication/pool_worker_child.c:983:2: error: expected identifier or '(' before 'if'
  983 |  if (line)
      |  ^~
streaming_replication/pool_worker_child.c:985:2: error: expected identifier or '(' before 'if'
  985 |  if (escaped_cmd)
      |  ^~
streaming_replication/pool_worker_child.c:987:2: error: expected identifier or '(' before 'if'
  987 |  if (cmd_allocated && command)
      |  ^~
streaming_replication/pool_worker_child.c:990:2: warning: data definition has no type or storage class
  990 |  error_context_stack = callback.previous;
      |  ^~~~~~~~~~~~~~~~~~~
streaming_replication/pool_worker_child.c:990:2: warning: type defaults to 'int' in declaration of 'error_context_stack' [-Wimplicit-int]
streaming_replication/pool_worker_child.c:990:2: error: conflicting types for 'error_context_stack'
In file included from streaming_replication/pool_worker_child.c:56:
../src/include/utils/elog.h:360:42: note: previous declaration of 'error_context_stack' was here
  360 | extern PGDLLIMPORT ErrorContextCallback *error_context_stack;
      |                                          ^~~~~~~~~~~~~~~~~~~
streaming_replication/pool_worker_child.c:990:24: error: 'callback' undeclared here (not in a function); did you mean 'calloc'?
  990 |  error_context_stack = callback.previous;
      |                        ^~~~~~~~
      |                        calloc
streaming_replication/pool_worker_child.c:991:1: error: expected identifier or '(' before '}' token
  991 | }
      | ^
In file included from streaming_replication/pool_worker_child.c:56:
streaming_replication/pool_worker_child.c: In function 'get_query_result':
../src/include/utils/elog.h:397:46: warning: initialization of 'ErrorContextCallback *' {aka 'struct ErrorContextCallback *'} from 'int' makes pointer from integer without a cast [-Wint-conversion]
  397 |   ErrorContextCallback *save_context_stack = error_context_stack; \
      |                                              ^~~~~~~~~~~~~~~~~~~
streaming_replication/pool_worker_child.c:1099:2: note: in expansion of macro 'PG_TRY'
 1099 |  PG_TRY();
      |  ^~~~~~
../src/include/utils/elog.h:408:24: warning: assignment to 'int' from 'ErrorContextCallback *' {aka 'struct ErrorContextCallback *'} makes integer from pointer without a cast [-Wint-conversion]
  408 |    error_context_stack = save_context_stack
      |                        ^
streaming_replication/pool_worker_child.c:1103:2: note: in expansion of macro 'PG_CATCH'
 1103 |  PG_CATCH();
      |  ^~~~~~~~
../src/include/utils/elog.h:413:23: warning: assignment to 'int' from 'ErrorContextCallback *' {aka 'struct ErrorContextCallback *'} makes integer from pointer without a cast [-Wint-conversion]
  413 |   error_context_stack = save_context_stack; \
      |                       ^
streaming_replication/pool_worker_child.c:1112:2: note: in expansion of macro 'PG_END_TRY'
 1112 |  PG_END_TRY();
      |  ^~~~~~~~~~
make[2]: *** [Makefile:883: streaming_replication/pool_worker_child.o] Error 1
make[2]: Leaving directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src'
make[1]: *** [Makefile:949: all-recursive] Error 1
make[1]: Leaving directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src'
make: *** [Makefile:413: all-recursive] Error 1


Attachments:

  [text/plain] compile.log (8.0K, 2-compile.log)
  download | inline:
Making all in src
make[1]: Entering directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src'
Making all in parser
make[2]: Entering directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src/parser'
make[2]: Nothing to be done for 'all'.
make[2]: Leaving directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src/parser'
Making all in libs
make[2]: Entering directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src/libs'
Making all in pcp
make[3]: Entering directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src/libs/pcp'
make[3]: Nothing to be done for 'all'.
make[3]: Leaving directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src/libs/pcp'
make[3]: Entering directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src/libs'
make[3]: Nothing to be done for 'all-am'.
make[3]: Leaving directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src/libs'
make[2]: Leaving directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src/libs'
Making all in watchdog
make[2]: Entering directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src/watchdog'
make[2]: Nothing to be done for 'all'.
make[2]: Leaving directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src/watchdog'
Making all in .
make[2]: Entering directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src'
gcc -DHAVE_CONFIG_H -DDEFAULT_CONFIGDIR=\"/usr/local/etc\" -DPGSQL_BIN_DIR=\"/usr/local/pgsql/bin\" -I. -I../src/include  -D_GNU_SOURCE -I /usr/local/pgsql/include   -g -O2 -Wall -Wmissing-prototypes -Wmissing-declarations -Wno-format-truncation -Wno-stringop-truncation -fno-strict-aliasing -c -o streaming_replication/pool_worker_child.o streaming_replication/pool_worker_child.c
streaming_replication/pool_worker_child.c: In function 'do_worker_child':
streaming_replication/pool_worker_child.c:266:5: warning: this 'else' clause does not guard... [-Wmisleading-indentation]
  266 |     else
      |     ^~~~
streaming_replication/pool_worker_child.c:270:6: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the 'else'
  270 |      node_status = verify_backend_node_status(slots);
      |      ^~~~~~~~~~~
In file included from streaming_replication/pool_worker_child.c:56:
streaming_replication/pool_worker_child.c: In function 'check_replication_time_lag_with_cmd':
../src/include/utils/elog.h:397:25: warning: unused variable 'save_context_stack' [-Wunused-variable]
  397 |   ErrorContextCallback *save_context_stack = error_context_stack; \
      |                         ^~~~~~~~~~~~~~~~~~
streaming_replication/pool_worker_child.c:786:2: note: in expansion of macro 'PG_TRY'
  786 |  PG_TRY();
      |  ^~~~~~
../src/include/utils/elog.h:396:15: warning: unused variable 'save_exception_stack' [-Wunused-variable]
  396 |   sigjmp_buf *save_exception_stack = PG_exception_stack; \
      |               ^~~~~~~~~~~~~~~~~~~~
streaming_replication/pool_worker_child.c:786:2: note: in expansion of macro 'PG_TRY'
  786 |  PG_TRY();
      |  ^~~~~~
../src/include/utils/elog.h:405:3: error: expected 'while' before 'else'
  405 |   else \
      |   ^~~~
streaming_replication/pool_worker_child.c:961:2: note: in expansion of macro 'PG_CATCH'
  961 |  PG_CATCH();
      |  ^~~~~~~~
../src/include/utils/elog.h:412:24: error: 'save_exception_stack' undeclared (first use in this function); did you mean 'PG_exception_stack'?
  412 |   PG_exception_stack = save_exception_stack; \
      |                        ^~~~~~~~~~~~~~~~~~~~
streaming_replication/pool_worker_child.c:980:2: note: in expansion of macro 'PG_END_TRY'
  980 |  PG_END_TRY();
      |  ^~~~~~~~~~
../src/include/utils/elog.h:412:24: note: each undeclared identifier is reported only once for each function it appears in
  412 |   PG_exception_stack = save_exception_stack; \
      |                        ^~~~~~~~~~~~~~~~~~~~
streaming_replication/pool_worker_child.c:980:2: note: in expansion of macro 'PG_END_TRY'
  980 |  PG_END_TRY();
      |  ^~~~~~~~~~
../src/include/utils/elog.h:413:25: error: 'save_context_stack' undeclared (first use in this function); did you mean 'error_context_stack'?
  413 |   error_context_stack = save_context_stack; \
      |                         ^~~~~~~~~~~~~~~~~~
streaming_replication/pool_worker_child.c:980:2: note: in expansion of macro 'PG_END_TRY'
  980 |  PG_END_TRY();
      |  ^~~~~~~~~~
streaming_replication/pool_worker_child.c:743:9: warning: variable 'cmd_allocated' set but not used [-Wunused-but-set-variable]
  743 |  bool   cmd_allocated = false;
      |         ^~~~~~~~~~~~~
In file included from streaming_replication/pool_worker_child.c:56:
streaming_replication/pool_worker_child.c: At top level:
../src/include/utils/elog.h:414:4: error: expected identifier or '(' before 'while'
  414 |  } while (0)
      |    ^~~~~
streaming_replication/pool_worker_child.c:980:2: note: in expansion of macro 'PG_END_TRY'
  980 |  PG_END_TRY();
      |  ^~~~~~~~~~
streaming_replication/pool_worker_child.c:983:2: error: expected identifier or '(' before 'if'
  983 |  if (line)
      |  ^~
streaming_replication/pool_worker_child.c:985:2: error: expected identifier or '(' before 'if'
  985 |  if (escaped_cmd)
      |  ^~
streaming_replication/pool_worker_child.c:987:2: error: expected identifier or '(' before 'if'
  987 |  if (cmd_allocated && command)
      |  ^~
streaming_replication/pool_worker_child.c:990:2: warning: data definition has no type or storage class
  990 |  error_context_stack = callback.previous;
      |  ^~~~~~~~~~~~~~~~~~~
streaming_replication/pool_worker_child.c:990:2: warning: type defaults to 'int' in declaration of 'error_context_stack' [-Wimplicit-int]
streaming_replication/pool_worker_child.c:990:2: error: conflicting types for 'error_context_stack'
In file included from streaming_replication/pool_worker_child.c:56:
../src/include/utils/elog.h:360:42: note: previous declaration of 'error_context_stack' was here
  360 | extern PGDLLIMPORT ErrorContextCallback *error_context_stack;
      |                                          ^~~~~~~~~~~~~~~~~~~
streaming_replication/pool_worker_child.c:990:24: error: 'callback' undeclared here (not in a function); did you mean 'calloc'?
  990 |  error_context_stack = callback.previous;
      |                        ^~~~~~~~
      |                        calloc
streaming_replication/pool_worker_child.c:991:1: error: expected identifier or '(' before '}' token
  991 | }
      | ^
In file included from streaming_replication/pool_worker_child.c:56:
streaming_replication/pool_worker_child.c: In function 'get_query_result':
../src/include/utils/elog.h:397:46: warning: initialization of 'ErrorContextCallback *' {aka 'struct ErrorContextCallback *'} from 'int' makes pointer from integer without a cast [-Wint-conversion]
  397 |   ErrorContextCallback *save_context_stack = error_context_stack; \
      |                                              ^~~~~~~~~~~~~~~~~~~
streaming_replication/pool_worker_child.c:1099:2: note: in expansion of macro 'PG_TRY'
 1099 |  PG_TRY();
      |  ^~~~~~
../src/include/utils/elog.h:408:24: warning: assignment to 'int' from 'ErrorContextCallback *' {aka 'struct ErrorContextCallback *'} makes integer from pointer without a cast [-Wint-conversion]
  408 |    error_context_stack = save_context_stack
      |                        ^
streaming_replication/pool_worker_child.c:1103:2: note: in expansion of macro 'PG_CATCH'
 1103 |  PG_CATCH();
      |  ^~~~~~~~
../src/include/utils/elog.h:413:23: warning: assignment to 'int' from 'ErrorContextCallback *' {aka 'struct ErrorContextCallback *'} makes integer from pointer without a cast [-Wint-conversion]
  413 |   error_context_stack = save_context_stack; \
      |                       ^
streaming_replication/pool_worker_child.c:1112:2: note: in expansion of macro 'PG_END_TRY'
 1112 |  PG_END_TRY();
      |  ^~~~~~~~~~
make[2]: *** [Makefile:883: streaming_replication/pool_worker_child.o] Error 1
make[2]: Leaving directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src'
make[1]: *** [Makefile:949: all-recursive] Error 1
make[1]: Leaving directory '/home/t-ishii/work/Pgpool-II/current/pgpool2/src'
make: *** [Makefile:413: all-recursive] Error 1

^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-08-25 12:50  Nadav Shatz <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Nadav Shatz @ 2025-08-25 12:50 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

Hi Tatsuo,

Thank you for the notes - please find attached an updated version.

What do you think?

Thanks,

On Mon, Aug 25, 2025 at 5:18 AM Tatsuo Ishii <[email protected]> wrote:

> Hi Nadav,
>
> Thank you for the patch!
>
> I have one question. How do you provide a password (sr_check_password)
> while executing replication_delay_source_cmd as sr_check_user? In my
> understanding replication_delay_source_cmd is executed through su
> command in your patch. In this case su command tries to read the
> password from terminal. I don't see such a code in the patch.
>
> BTW, I start to think that executing replication_delay_source_cmd as
> sr_check_user might not be a good idea. sr_check_user is a database
> user, not OS user. In PostgreSQL they are not necessarily the
> same. Also doing su in pgpool process needs to be very carefully to
> avoid vulnerability. Probably we just execute it as pgpool OS user?
>
> Lastly when I apply the patches using git apply, there are some
> trailing space errors.
>
> $ git apply ~/external-lag-feature-implementation.patch
> /home/t-ishii/external-lag-feature-implementation.patch:314: trailing
> whitespace.
>
> /home/t-ishii/external-lag-feature-implementation.patch:317: trailing
> whitespace.
>
> /home/t-ishii/external-lag-feature-implementation.patch:318: trailing
> whitespace.
>                         cmd_len = strlen(escaped_cmd) +
> /home/t-ishii/external-lag-feature-implementation.patch:320: trailing
> whitespace.
>
> /home/t-ishii/external-lag-feature-implementation.patch:322: trailing
> whitespace.
>                         snprintf(full_command, cmd_len, "su - %s -c '%s'",
> warning: squelched 4 whitespace errors
> warning: 9 lines add whitespace errors.
>
> $ git apply ~/external-lag-feature-tests.patch
> /home/t-ishii/external-lag-feature-tests.patch:87: trailing whitespace.
> - test_parsing.sh: Unit test for parsing logic
> /home/t-ishii/external-lag-feature-tests.patch:440: trailing whitespace.
> # Test 2: Float values
> warning: 2 lines add whitespace errors.
>
> Also I have some compilation errors after patching the source
> code. See attached compilation log.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>


-- 
Nadav Shatz
Tailor Brands | CTO


Attachments:

  [application/octet-stream] changes.patch (35.3K, 3-changes.patch)
  download | inline diff:
diff --git c/src/config/pool_config_variables.c w/src/config/pool_config_variables.c
index 5bbe46d3a..d2be434e0 100644
--- c/src/config/pool_config_variables.c
+++ w/src/config/pool_config_variables.c
@@ -310,6 +310,12 @@ static const struct config_enum_entry check_temp_table_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry replication_delay_source_options[] = {
+	{"builtin", REPLICATION_DELAY_BUILTIN, false},
+	{"cmd", REPLICATION_DELAY_CMD, false},
+	{NULL, 0, false}
+};
+
 static const struct config_enum_entry log_backend_messages_options[] = {
 	{"none", BGMSG_NONE, false},	/* turn off logging */
 	{"terse", BGMSG_TERSE, false},	/* terse logging (repeated messages are
@@ -980,6 +986,16 @@ static struct config_string ConfigureNamesString[] =
 		NULL, NULL, NULL, NULL
 	},
 
+	{
+		{"replication_delay_source_cmd", CFGCXT_RELOAD, STREAMING_REPLICATION_CONFIG,
+			"External command to retrieve replication delay information.",
+			CONFIG_VAR_TYPE_STRING, false, 0
+		},
+		&g_pool_config.replication_delay_source_cmd,
+		"",
+		NULL, NULL, NULL, NULL
+	},
+
 	{
 		{"failback_command", CFGCXT_RELOAD, FAILOVER_CONFIG,
 			"Command to execute when backend node is attached.",
@@ -2323,6 +2339,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"replication_delay_source_timeout", CFGCXT_RELOAD, STREAMING_REPLICATION_CONFIG,
+			"Timeout for external replication delay command execution in seconds.",
+			CONFIG_VAR_TYPE_INT, false, 0
+		},
+		&g_pool_config.replication_delay_source_timeout,
+		10,
+		1, 3600,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	EMPTY_CONFIG_INT
 };
@@ -2485,6 +2512,17 @@ static struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL, NULL
 	},
 
+	{
+		{"replication_delay_source", CFGCXT_RELOAD, STREAMING_REPLICATION_CONFIG,
+			"Source of replication delay information.",
+			CONFIG_VAR_TYPE_ENUM, false, 0
+		},
+		(int *) &g_pool_config.replication_delay_source,
+		REPLICATION_DELAY_BUILTIN,
+		replication_delay_source_options,
+		NULL, NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	EMPTY_CONFIG_ENUM
 };
diff --git c/src/include/pool_config.h w/src/include/pool_config.h
index be82750e5..1a8262dd7 100644
--- c/src/include/pool_config.h
+++ w/src/include/pool_config.h
@@ -94,6 +94,12 @@ typedef enum LogStandbyDelayModes
 	LSD_NONE
 } LogStandbyDelayModes;
 
+typedef enum ReplicationDelaySourceModes
+{
+	REPLICATION_DELAY_BUILTIN = 1,
+	REPLICATION_DELAY_CMD
+} ReplicationDelaySourceModes;
+
 
 typedef enum MemCacheMethod
 {
@@ -371,6 +377,9 @@ typedef struct
 	char	   *sr_check_password;	/* password for sr_check_user */
 	char	   *sr_check_database;	/* PostgreSQL database name for streaming
 									 * replication check */
+	int			replication_delay_source;	/* replication delay source: builtin or cmd */
+	char	   *replication_delay_source_cmd;	/* external command for replication delay */
+	int			replication_delay_source_timeout;	/* timeout for external command in seconds */
 	char	   *failover_command;	/* execute command when failover happens */
 	char	   *follow_primary_command; /* execute command when failover is
 										 * ended */
diff --git c/src/sample/pgpool.conf.sample-stream w/src/sample/pgpool.conf.sample-stream
index a7eb594c9..662e767a6 100644
--- c/src/sample/pgpool.conf.sample-stream
+++ w/src/sample/pgpool.conf.sample-stream
@@ -519,6 +519,22 @@ backend_clustering_mode = streaming_replication
 
 #sr_check_database = 'postgres'
                                    # Database name for streaming replication check
+
+#replication_delay_source = 'builtin'
+                                   # Source of replication delay information
+                                   # 'builtin': use built-in database queries (default)
+                                   # 'cmd': use external command
+#replication_delay_source_cmd = ''
+                                   # External command to retrieve replication delay information
+                                   # Only used when replication_delay_source = 'cmd'
+                                   # Command should output delay values in milliseconds
+                                   # Format: "0 20 10" (node0 node1 node2 delays)
+                                   # Command runs as the pgpool process user
+#replication_delay_source_timeout = 10
+                                   # Timeout for external command execution in seconds
+                                   # Only used when replication_delay_source = 'cmd'
+                                   # Range: 1-3600 seconds (default: 10)
+
 #delay_threshold = 0
                                    # Threshold before not dispatching query to standby node
                                    # Unit is in bytes
diff --git c/src/streaming_replication/pool_worker_child.c w/src/streaming_replication/pool_worker_child.c
index 4f8f823a3..260989f27 100644
--- c/src/streaming_replication/pool_worker_child.c
+++ w/src/streaming_replication/pool_worker_child.c
@@ -76,6 +76,7 @@ static volatile sig_atomic_t restart_request = 0;
 static void establish_persistent_connection(void);
 static void discard_persistent_connection(void);
 static void check_replication_time_lag(void);
+static void check_replication_time_lag_with_cmd(void);
 static void CheckReplicationTimeLagErrorCb(void *arg);
 static unsigned long long int text_to_lsn(char *text);
 static RETSIGTYPE my_signal_handler(int sig);
@@ -259,7 +260,10 @@ do_worker_child(void)
 					POOL_NODE_STATUS *node_status;
 					int			i;
 
-					/* Do replication time lag checking */
+				/* Do replication time lag checking */
+				if (pool_config->replication_delay_source == REPLICATION_DELAY_CMD)
+					check_replication_time_lag_with_cmd();
+				else
 					check_replication_time_lag();
 
 					/* Check node status */
@@ -659,6 +663,260 @@ check_replication_time_lag(void)
 	error_context_stack = callback.previous;
 }
 
+#define MAX_CMD_OUTPUT 4096
+#define MAX_REASONABLE_DELAY_MS 3600000.0  /* 1 hour in milliseconds */
+
+/* Global variable to track command timeout */
+static volatile sig_atomic_t command_timeout_occurred = 0;
+
+/*
+ * Signal handler for command timeou
+ */
+static void
+command_timeout_handler(int sig)
+{
+	command_timeout_occurred = 1;
+}
+
+
+
+/*
+ * Check replication time lag using external command
+ */
+static void
+check_replication_time_lag_with_cmd(void)
+{
+	FILE		   *fp;
+	char		   *command;
+	char		   *line;
+	char		   *token;
+	char		   *saveptr;
+	int				node_id;
+	double			delay_ms;
+	uint64			delay;
+	int				token_count = 0;
+	BackendInfo	   *bkinfo;
+	ErrorContextCallback callback;
+
+	if (NUM_BACKENDS <= 1)
+	{
+		/* If there's only one node, there's no point to do checking */
+		return;
+	}
+
+	if (REAL_PRIMARY_NODE_ID < 0)
+	{
+		/* No need to check if there's no primary */
+		return;
+	}
+
+	if (!VALID_BACKEND(REAL_PRIMARY_NODE_ID))
+	{
+		/* No need to check replication delay if primary is down */
+		return;
+	}
+
+	if (!pool_config->replication_delay_source_cmd ||
+		strlen(pool_config->replication_delay_source_cmd) == 0)
+	{
+		ereport(WARNING,
+				(errmsg("replication_delay_source is set to 'cmd' but replication_delay_source_cmd is not configured"),
+				 errhint("Set replication_delay_source_cmd or change replication_delay_source to 'builtin'")));
+		/* Fall back to builtin method */
+		check_replication_time_lag();
+		return;
+	}
+
+	/* Allocate buffer for command output */
+	line = palloc(MAX_CMD_OUTPUT);
+
+	/*
+	 * Register a error context callback to throw proper context message
+	 */
+	callback.callback = CheckReplicationTimeLagErrorCb;
+	callback.arg = NULL;
+	callback.previous = error_context_stack;
+	error_context_stack = &callback;
+
+	/* Execute command as current process user */
+	PG_TRY();
+	{
+		command = pool_config->replication_delay_source_cmd;
+
+		ereport(DEBUG1,
+				(errmsg("executing replication delay command: %s", command)));
+
+		/* Set up timeout for command execution */
+		command_timeout_occurred = 0;
+		signal(SIGALRM, command_timeout_handler);
+		alarm(pool_config->replication_delay_source_timeout);
+
+		fp = popen(command, "r");
+		if (fp == NULL)
+		{
+			alarm(0); /* Cancel alarm */
+			signal(SIGALRM, SIG_DFL);
+			ereport(ERROR,
+					(errmsg("failed to execute replication delay command: %s", command),
+					 errdetail("popen failed: %m")));
+		}
+
+		if (fgets(line, MAX_CMD_OUTPUT, fp) == NULL)
+		{
+			int pclose_result = pclose(fp);
+			fp = NULL;
+			alarm(0); /* Cancel alarm */
+			signal(SIGALRM, SIG_DFL);
+
+			if (command_timeout_occurred)
+			{
+				ereport(ERROR,
+						(errmsg("replication delay command timed out after %d seconds: %s",
+								pool_config->replication_delay_source_timeout, command),
+						 errhint("Consider increasing replication_delay_source_timeout or optimizing the command")));
+			}
+			else
+			{
+				ereport(ERROR,
+						(errmsg("failed to read output from replication delay command: %s", command),
+						 errdetail("command exit status: %d", pclose_result)));
+			}
+		}
+
+		alarm(0); /* Cancel alarm */
+		signal(SIGALRM, SIG_DFL);
+
+		/* Check if output was truncated */
+		if (strlen(line) == MAX_CMD_OUTPUT - 1 && line[MAX_CMD_OUTPUT - 2] != '\n')
+		{
+			ereport(WARNING,
+					(errmsg("replication delay command output may have been truncated")));
+		}
+
+		pclose(fp);
+		fp = NULL;
+
+		/* Parse the output format "0 20 10" where each number is delay in milliseconds for nodes 0, 1, 2 etc */
+		/* Count tokens first for validation */
+		char *line_copy = pstrdup(line);
+		char *temp_token = strtok(line_copy, " \t\n");
+		while (temp_token != NULL)
+		{
+			token_count++;
+			temp_token = strtok(NULL, " \t\n");
+		}
+		pfree(line_copy);
+
+		/* Now parse the actual tokens */
+		token = strtok_r(line, " \t\n", &saveptr);
+		node_id = 0;
+
+		if (token_count != NUM_BACKENDS)
+		{
+			ereport(WARNING,
+					(errmsg("replication delay command returned %d values, expected %d",
+							token_count, NUM_BACKENDS),
+					 errhint("Command should output one delay value per backend node")));
+		}
+
+		while (token != NULL && node_id < NUM_BACKENDS)
+		{
+			if (!VALID_BACKEND(node_id))
+			{
+				node_id++;
+				token = strtok_r(NULL, " \t\n", &saveptr);
+				continue;
+			}
+
+			char *endptr;
+			delay_ms = strtod(token, &endptr);
+
+			/* Validate the conversion */
+			if (*endptr != '\0')
+			{
+				ereport(WARNING,
+						(errmsg("invalid delay value '%s' for node %d, treating as 0",
+								token, node_id)));
+				delay_ms = 0;
+			}
+
+			/* Validate delay value range */
+			if (delay_ms < 0)
+			{
+				ereport(WARNING,
+						(errmsg("negative delay value %.3f for node %d, treating as 0",
+								delay_ms, node_id)));
+				delay_ms = 0;
+			}
+			else if (delay_ms > MAX_REASONABLE_DELAY_MS)
+			{
+				ereport(WARNING,
+						(errmsg("extremely large delay value %.3f for node %d",
+								delay_ms, node_id)));
+			}
+
+			bkinfo = pool_get_node_info(node_id);
+
+			if (PRIMARY_NODE_ID == node_id)
+			{
+				/* Primary node should always have 0 delay */
+				bkinfo->standby_delay = 0;
+				if (delay_ms > 0)
+				{
+					ereport(DEBUG1,
+							(errmsg("primary node %d reported non-zero delay %.3f, setting to 0",
+									node_id, delay_ms)));
+				}
+			}
+			else
+			{
+				/* Convert delay from milliseconds to microseconds for internal storage */
+				delay = (uint64)(delay_ms * 1000);
+				bkinfo->standby_delay = delay;
+				bkinfo->standby_delay_by_time = true;
+
+				/* Log delay if necessary */
+				uint64 delay_threshold_by_time = pool_config->delay_threshold_by_time * 1000; /* threshold is in milliseconds, convert to microseconds */
+
+				if ((pool_config->log_standby_delay == LSD_ALWAYS && delay_ms > 0) ||
+					(pool_config->log_standby_delay == LSD_OVER_THRESHOLD &&
+					 bkinfo->standby_delay > delay_threshold_by_time))
+				{
+					ereport(LOG,
+							(errmsg("Replication of node: %d is behind %.3f second(s) from the primary server (node: %d) [external command]",
+									node_id, delay_ms / 1000, PRIMARY_NODE_ID)));
+				}
+			}
+
+			node_id++;
+			token = strtok_r(NULL, " \t\n", &saveptr);
+		}
+
+	}
+	PG_CATCH();
+	{
+		/* Cleanup in case of error */
+		alarm(0); /* Cancel any pending alarm */
+		signal(SIGALRM, SIG_DFL);
+		if (fp)
+		{
+			pclose(fp);
+			fp = NULL;
+		}
+		if (line)
+			pfree(line);
+		error_context_stack = callback.previous;
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	/* Normal cleanup */
+	if (line)
+		pfree(line);
+
+	error_context_stack = callback.previous;
+}
+
 static void
 CheckReplicationTimeLagErrorCb(void *arg)
 {
diff --git c/src/test/regression/tests/083.external_replication_delay/README w/src/test/regression/tests/083.external_replication_delay/README
new file mode 100644
index 000000000..a3257ea11
--- /dev/null
+++ w/src/test/regression/tests/083.external_replication_delay/README
@@ -0,0 +1,43 @@
+External Replication Delay Command Tes
+========================================
+
+This test verifies the external command replication delay source feature.
+
+Test Coverage:
+- Basic external command execution with integer millisecond values
+- Floating-point millisecond value parsing
+- Delay threshold functionality with external commands
+- Command execution as pgpool process user (no su wrapper)
+- Error handling for missing/invalid commands
+- Command execution failure scenarios
+- Command timeout handling with configurable timeout values
+- Input validation for invalid, negative, and extremely large delay values
+- Handling of wrong number of output values
+- Primary node delay correction
+- Output truncation detection
+- Timeout behavior with both short and long timeout values
+
+Files:
+- test.sh: Main test scrip
+- test_parsing.sh: Unit test for parsing logic
+- test_validation.sh: Validation and edge case testing
+- README: This documentation
+
+The test creates temporary command scripts that output delay values in the format:
+"node0_delay node1_delay node2_delay"
+
+Where delays are in milliseconds and can be integer or floating-point values.
+
+Test Environment:
+- Uses streaming replication mode with 3 nodes
+- Configures sr_check_period = 1 second for faster testing
+- Tests various delay scenarios and threshold behaviors
+
+Expected Behavior:
+- External commands should be executed as configured
+- Delay values should be parsed correctly (both int and float)
+- Threshold comparisons should work properly
+- Error conditions should be handled gracefully
+- Commands should timeout appropriately based on configuration
+- Timeout errors should provide helpful messages and hints
+- Tests should be reliable with proper wait mechanisms instead of fixed sleeps
\ No newline at end of file
diff --git c/src/test/regression/tests/083.external_replication_delay/test.sh w/src/test/regression/tests/083.external_replication_delay/test.sh
new file mode 100755
index 000000000..57abdfc03
--- /dev/null
+++ w/src/test/regression/tests/083.external_replication_delay/test.sh
@@ -0,0 +1,309 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for external command replication delay source
+#
+source $TESTLIBS
+TESTDIR=testdir
+PG_CTL=$PGBIN/pg_ctl
+PSQL="$PGBIN/psql -X "
+
+rm -fr $TESTDIR
+mkdir $TESTDIR
+cd $TESTDIR
+
+# create test environmen
+echo -n "creating test environment..."
+$PGPOOL_SETUP -m s -n 3 || exit 1
+echo "done."
+source ./bashrc.ports
+export PGPORT=$PGPOOL_PORT
+
+# Create external command scripts for testing
+cat > delay_cmd_static.sh << 'EOF'
+#!/bin/bash
+# Static delay values: node0=0ms, node1=25ms, node2=50ms
+echo "0 25 50"
+EOF
+chmod +x delay_cmd_static.sh
+
+cat > delay_cmd_float.sh << 'EOF'
+#!/bin/bash
+# Float delay values: node0=0ms, node1=25.5ms, node2=100.75ms
+echo "0 25.5 100.75"
+EOF
+chmod +x delay_cmd_float.sh
+
+cat > delay_cmd_high.sh << 'EOF'
+#!/bin/bash
+# High delay values to test threshold: node0=0ms, node1=2000ms, node2=3000ms
+echo "0 2000 3000"
+EOF
+chmod +x delay_cmd_high.sh
+
+# ----------------------------------------------------------------------------------------
+echo === Test1: Basic external command with integer millisecond values ===
+# ----------------------------------------------------------------------------------------
+echo "replication_delay_source = 'cmd'" >> etc/pgpool.conf
+echo "replication_delay_source_cmd = './delay_cmd_static.sh'" >> etc/pgpool.conf
+echo "sr_check_period = 1" >> etc/pgpool.conf
+echo "log_standby_delay = 'always'" >> etc/pgpool.conf
+echo "log_min_messages = 'DEBUG1'" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+$PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+EOF
+
+# Wait for sr_check to run and populate delay values
+# sr_check_period is 1 second, so wait a bit longer to ensure it runs
+echo "Waiting for sr_check to run..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command" log/pgpool.log 2>/dev/null; then
+        echo "Command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SHOW POOL_NODES;
+EOF
+
+# Check that delay values are populated in the log
+grep "executing replication delay command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: external command was not executed
+    echo "Log contents:"
+    tail -20 log/pgpool.log
+    ./shutdownall
+    exit 1
+fi
+
+# Verify actual delay values were parsed
+if ! $PSQL -t -c "SHOW POOL_NODES" test | grep -E "[0-9]+\.[0-9]+" >/dev/null; then
+    echo "Warning: No delay values found in POOL_NODES output"
+fi
+
+# Check for delay log messages
+grep "Replication of node.*external command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: external command delay logging not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: basic external command test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test2: External command with floating-point millisecond values ===
+# ----------------------------------------------------------------------------------------
+# Update configuration to use float command
+sed -i.bak "s|delay_cmd_static.sh|delay_cmd_float.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run with float values
+echo "Waiting for sr_check with float values..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*delay_cmd_float.sh" log/pgpool.log 2>/dev/null; then
+        echo "Float command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SHOW POOL_NODES;
+EOF
+
+# Check that float values are handled correctly
+grep "executing replication delay command.*delay_cmd_float.sh" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: float command was not executed
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: floating-point values test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test3: External command with delay threshold ===
+# ----------------------------------------------------------------------------------------
+# Update configuration to use high delay command and set threshold
+sed -i.bak "s|delay_cmd_float.sh|delay_cmd_high.sh|" etc/pgpool.conf
+echo "delay_threshold_by_time = 1000" >> etc/pgpool.conf
+echo "backend_weight0 = 0" >> etc/pgpool.conf  # Force queries to standby normally
+echo "backend_weight2 = 0" >> etc/pgpool.conf  # Only use node 1 as standby
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run and detect high delays
+echo "Waiting for sr_check with high delay values..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*delay_cmd_high.sh" log/pgpool.log 2>/dev/null; then
+        echo "High delay command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SELECT * FROM t1 LIMIT 1;
+EOF
+
+# With high delays (2000ms > 1000ms threshold), query should go to primary (node 0)
+grep "SELECT \* FROM t1 LIMIT 1.*DB node id: 0" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: query was not sent to primary node despite high delay
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: delay threshold test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test4: External command execution as process user ===
+# ----------------------------------------------------------------------------------------
+# Test that command runs as the current pgpool process user
+sed -i.bak "s|delay_cmd_high.sh|delay_cmd_static.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for sr_check to run as process user..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*delay_cmd_static.sh" log/pgpool.log 2>/dev/null; then
+        echo "Command executed as process user after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check that command was executed (without su wrapper)
+grep "executing replication delay command.*delay_cmd_static.sh" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: command was not executed as process user
+    ./shutdownall
+    exit 1
+fi
+
+# Verify no su command was used
+if grep -q "executing replication delay command.*su.*" log/pgpool.log 2>/dev/null; then
+    echo fail: command should not use su wrapper
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: process user execution test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test5: Error handling - missing command ===
+# ----------------------------------------------------------------------------------------
+# Test error handling when command is not configured
+sed -i.bak "s|replication_delay_source_cmd = './delay_cmd_static.sh'|replication_delay_source_cmd = ''|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run with missing command
+echo "Waiting for sr_check with missing command..."
+for i in {1..5}; do
+    if grep -q "replication_delay_source_cmd is not configured" log/pgpool.log 2>/dev/null; then
+        echo "Missing command error detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for error message about missing command
+grep "replication_delay_source_cmd is not configured" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: missing command error not detected
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: error handling test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test6: Error handling - command execution failure ===
+# ----------------------------------------------------------------------------------------
+# Test error handling when command fails
+echo "replication_delay_source_cmd = './nonexistent_command.sh'" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run with failing command
+echo "Waiting for sr_check with failing command..."
+for i in {1..5}; do
+    if grep -q "failed to execute replication delay command" log/pgpool.log 2>/dev/null; then
+        echo "Command failure detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for error message about command execution failure
+grep "failed to execute replication delay command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: command execution failure not detected
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: command failure test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test7: Command timeout handling ===
+# ----------------------------------------------------------------------------------------
+# Create a command that takes longer than the timeou
+cat > delay_cmd_slow.sh << 'EOF'
+#!/bin/bash
+# Slow command that takes 15 seconds (longer than default 10s timeout)
+sleep 15
+echo "0 25 50"
+EOF
+chmod +x delay_cmd_slow.sh
+
+# Set a short timeout and use the slow command
+sed -i.bak "s|replication_delay_source_cmd = './nonexistent_command.sh'|replication_delay_source_cmd = './delay_cmd_slow.sh'|" etc/pgpool.conf
+echo "replication_delay_source_timeout = 3" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run and timeou
+echo "Waiting for command timeout..."
+for i in {1..15}; do
+    if grep -q "replication delay command timed out" log/pgpool.log 2>/dev/null; then
+        echo "Command timeout detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for timeout error message
+grep "replication delay command timed out after 3 seconds" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: command timeout not detected
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: command timeout test succeeded
+./shutdownall
+
+echo "All external replication delay tests passed!"
+exit 0
\ No newline at end of file
diff --git c/src/test/regression/tests/083.external_replication_delay/test_parsing.sh w/src/test/regression/tests/083.external_replication_delay/test_parsing.sh
new file mode 100755
index 000000000..7efdb07fb
--- /dev/null
+++ w/src/test/regression/tests/083.external_replication_delay/test_parsing.sh
@@ -0,0 +1,54 @@
+#!/bin/bash
+#-------------------------------------------------------------------
+# Unit test for external command parsing logic
+# This tests the parsing without needing a full pgpool setup
+#
+
+echo "=== Testing external command output parsing ==="
+
+# Test 1: Integer values
+echo "Test 1: Integer millisecond values"
+echo "0 25 50" > test_output.tx
+echo "Expected: 0ms, 25ms, 50ms"
+echo "Output: $(cat test_output.txt)"
+echo ""
+
+# Test 2: Float values
+echo "Test 2: Floating-point millisecond values"
+echo "0 25.5 100.75" > test_output_float.tx
+echo "Expected: 0ms, 25.5ms, 100.75ms"
+echo "Output: $(cat test_output_float.txt)"
+echo ""
+
+# Test 3: High precision float values
+echo "Test 3: High precision values"
+echo "0 0.001 999.999" > test_output_precision.tx
+echo "Expected: 0ms, 0.001ms, 999.999ms"
+echo "Output: $(cat test_output_precision.txt)"
+echo ""
+
+# Test 4: Edge case - zero values
+echo "Test 4: All zero values"
+echo "0 0 0" > test_output_zeros.tx
+echo "Expected: 0ms, 0ms, 0ms"
+echo "Output: $(cat test_output_zeros.txt)"
+echo ""
+
+# Test 5: Edge case - large values
+echo "Test 5: Large delay values"
+echo "0 5000 10000" > test_output_large.tx
+echo "Expected: 0ms, 5000ms, 10000ms"
+echo "Output: $(cat test_output_large.txt)"
+echo ""
+
+# Test 6: Mixed integer and floa
+echo "Test 6: Mixed integer and float values"
+echo "0 25 50.5" > test_output_mixed.tx
+echo "Expected: 0ms, 25ms, 50.5ms"
+echo "Output: $(cat test_output_mixed.txt)"
+echo ""
+
+# Cleanup
+rm -f test_output*.tx
+
+echo "All parsing tests completed. These outputs should be parseable by the external command feature."
\ No newline at end of file
diff --git c/src/test/regression/tests/083.external_replication_delay/test_validation.sh w/src/test/regression/tests/083.external_replication_delay/test_validation.sh
new file mode 100755
index 000000000..e14422eed
--- /dev/null
+++ w/src/test/regression/tests/083.external_replication_delay/test_validation.sh
@@ -0,0 +1,285 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for external command validation and edge cases
+#
+source $TESTLIBS
+TESTDIR=testdir_validation
+PG_CTL=$PGBIN/pg_ctl
+PSQL="$PGBIN/psql -X "
+
+rm -fr $TESTDIR
+mkdir $TESTDIR
+cd $TESTDIR
+
+# create test environmen
+echo -n "creating test environment..."
+$PGPOOL_SETUP -m s -n 3 || exit 1
+echo "done."
+source ./bashrc.ports
+export PGPORT=$PGPOOL_PORT
+
+# Create test command scripts
+cat > delay_cmd_validation.sh << 'EOF'
+#!/bin/bash
+# Test validation: output with invalid values
+echo "0 invalid_value 50.5"
+EOF
+chmod +x delay_cmd_validation.sh
+
+cat > delay_cmd_negative.sh << 'EOF'
+#!/bin/bash
+# Test negative values
+echo "0 -25 50"
+EOF
+chmod +x delay_cmd_negative.sh
+
+cat > delay_cmd_large.sh << 'EOF'
+#!/bin/bash
+# Test extremely large values
+echo "0 9999999 50"
+EOF
+chmod +x delay_cmd_large.sh
+
+cat > delay_cmd_wrong_count.sh << 'EOF'
+#!/bin/bash
+# Test wrong number of values (only 2 instead of 3)
+echo "0 25"
+EOF
+chmod +x delay_cmd_wrong_count.sh
+
+cat > delay_cmd_truncated.sh << 'EOF'
+#!/bin/bash
+# Test output that might be truncated (very long line)
+printf "0 25 "
+for i in {1..1000}; do printf "very_long_output_"; done
+echo "50"
+EOF
+chmod +x delay_cmd_truncated.sh
+
+# ----------------------------------------------------------------------------------------
+echo === Test1: Validation of invalid delay values ===
+# ----------------------------------------------------------------------------------------
+echo "replication_delay_source = 'cmd'" >> etc/pgpool.conf
+echo "replication_delay_source_cmd = './delay_cmd_validation.sh'" >> etc/pgpool.conf
+echo "sr_check_period = 1" >> etc/pgpool.conf
+echo "log_standby_delay = 'always'" >> etc/pgpool.conf
+echo "log_min_messages = 'DEBUG1'" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+$PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+EOF
+
+# Wait for sr_check to run
+echo "Waiting for validation test..."
+for i in {1..10}; do
+    if grep -q "invalid delay value" log/pgpool.log 2>/dev/null; then
+        echo "Validation error detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for validation warning
+grep "invalid delay value 'invalid_value' for node" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: validation warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: invalid value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test2: Negative delay values ===
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_validation.sh|delay_cmd_negative.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for negative value test..."
+for i in {1..10}; do
+    if grep -q "negative delay value" log/pgpool.log 2>/dev/null; then
+        echo "Negative value warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for negative value warning
+grep "negative delay value.*for node" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: negative value warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: negative value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test3: Extremely large delay values ===
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_negative.sh|delay_cmd_large.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for large value test..."
+for i in {1..10}; do
+    if grep -q "extremely large delay value" log/pgpool.log 2>/dev/null; then
+        echo "Large value warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for large value warning
+grep "extremely large delay value.*for node" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: large value warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: large value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test4: Wrong number of output values ===
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_large.sh|delay_cmd_wrong_count.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for wrong count test..."
+for i in {1..10}; do
+    if grep -q "returned.*values, expected" log/pgpool.log 2>/dev/null; then
+        echo "Wrong count warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for wrong count warning
+grep "returned.*values, expected.*Command should output one delay value per backend node" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: wrong count warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: wrong count validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test5: Primary node non-zero delay handling ===
+# ----------------------------------------------------------------------------------------
+cat > delay_cmd_primary_nonzero.sh << 'EOF'
+#!/bin/bash
+# Test primary node with non-zero delay (should be corrected to 0)
+echo "100 25 50"
+EOF
+chmod +x delay_cmd_primary_nonzero.sh
+
+sed -i.bak "s|delay_cmd_wrong_count.sh|delay_cmd_primary_nonzero.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for primary non-zero delay test..."
+for i in {1..10}; do
+    if grep -q "primary node.*reported non-zero delay" log/pgpool.log 2>/dev/null; then
+        echo "Primary non-zero delay detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for primary node correction
+grep "primary node.*reported non-zero delay.*setting to 0" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: primary node delay correction not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: primary node delay correction test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test6: Command timeout with different timeout values ===
+# ----------------------------------------------------------------------------------------
+cat > delay_cmd_timeout.sh << 'EOF'
+#!/bin/bash
+# Command that takes 5 seconds
+sleep 5
+echo "0 25 50"
+EOF
+chmod +x delay_cmd_timeout.sh
+
+# Test with timeout shorter than command duration
+sed -i.bak "s|delay_cmd_primary_nonzero.sh|delay_cmd_timeout.sh|" etc/pgpool.conf
+echo "replication_delay_source_timeout = 2" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for timeou
+echo "Waiting for timeout test (2s timeout, 5s command)..."
+for i in {1..10}; do
+    if grep -q "replication delay command timed out after 2 seconds" log/pgpool.log 2>/dev/null; then
+        echo "Timeout detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for timeout message
+grep "replication delay command timed out after 2 seconds" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: timeout not detected
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: timeout test succeeded
+./shutdownall
+
+# Test with timeout longer than command duration
+sed -i.bak "s|replication_delay_source_timeout = 2|replication_delay_source_timeout = 10|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for successful execution
+echo "Waiting for successful execution (10s timeout, 5s command)..."
+for i in {1..15}; do
+    if grep -q "executing replication delay command.*delay_cmd_timeout.sh" log/pgpool.log 2>/dev/null; then
+        echo "Command executed successfully after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Should not timeout this time
+if grep -q "replication delay command timed out" log/pgpool.log 2>/dev/null; then
+    echo fail: command should not have timed out with 10s timeou
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: extended timeout test succeeded
+./shutdownall
+
+echo "All validation tests passed!"
+exit 0
\ No newline at end of file


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-08-26 01:41  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2025-08-26 01:41 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

Hi Nadav,

Thank you for updating the patch. I will look into that.

I have a question. Have you actually tried the patch with AWS Aurora?
I am wondering how patched pgpool works with Aurora. I am asking
because in the doc "8.5. Aurora Configuration Example":

 Set sr_check_period to 0 to disable streaming replication delay
 checking. This is because Aurora does not provide necessary functions
 to check the replication delay.

 sr_check_period = 0

So streaming replication checking is disabled, and it means that your
patch is also effectively disabled too.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

> Hi Tatsuo,
> 
> Thank you for the notes - please find attached an updated version.
> 
> What do you think?
> 
> Thanks,
> 
> On Mon, Aug 25, 2025 at 5:18 AM Tatsuo Ishii <[email protected]> wrote:
> 
>> Hi Nadav,
>>
>> Thank you for the patch!
>>
>> I have one question. How do you provide a password (sr_check_password)
>> while executing replication_delay_source_cmd as sr_check_user? In my
>> understanding replication_delay_source_cmd is executed through su
>> command in your patch. In this case su command tries to read the
>> password from terminal. I don't see such a code in the patch.
>>
>> BTW, I start to think that executing replication_delay_source_cmd as
>> sr_check_user might not be a good idea. sr_check_user is a database
>> user, not OS user. In PostgreSQL they are not necessarily the
>> same. Also doing su in pgpool process needs to be very carefully to
>> avoid vulnerability. Probably we just execute it as pgpool OS user?
>>
>> Lastly when I apply the patches using git apply, there are some
>> trailing space errors.
>>
>> $ git apply ~/external-lag-feature-implementation.patch
>> /home/t-ishii/external-lag-feature-implementation.patch:314: trailing
>> whitespace.
>>
>> /home/t-ishii/external-lag-feature-implementation.patch:317: trailing
>> whitespace.
>>
>> /home/t-ishii/external-lag-feature-implementation.patch:318: trailing
>> whitespace.
>>                         cmd_len = strlen(escaped_cmd) +
>> /home/t-ishii/external-lag-feature-implementation.patch:320: trailing
>> whitespace.
>>
>> /home/t-ishii/external-lag-feature-implementation.patch:322: trailing
>> whitespace.
>>                         snprintf(full_command, cmd_len, "su - %s -c '%s'",
>> warning: squelched 4 whitespace errors
>> warning: 9 lines add whitespace errors.
>>
>> $ git apply ~/external-lag-feature-tests.patch
>> /home/t-ishii/external-lag-feature-tests.patch:87: trailing whitespace.
>> - test_parsing.sh: Unit test for parsing logic
>> /home/t-ishii/external-lag-feature-tests.patch:440: trailing whitespace.
>> # Test 2: Float values
>> warning: 2 lines add whitespace errors.
>>
>> Also I have some compilation errors after patching the source
>> code. See attached compilation log.
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
> 
> 
> -- 
> Nadav Shatz
> Tailor Brands | CTO


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-08-26 06:54  Nadav Shatz <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Nadav Shatz @ 2025-08-26 06:54 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

Hi Tatsuo,

I haven’t tried it yet but the whole premise of having it run a command is
that it’s not dependent on the specific DB. As you mentioned earlier.

The issue blocking the regular lag extraction from aurora is that it
doesn’t update the tables in the DB. It does have a CloudWatch API to get
the numbers tho.

You can see the metric AuroraReplicaLag under
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/metrics-reference.html

So if we have a simple command to either get it or have something else
update a file with the numbers based on it we’ll be fine.

Ordering here could get tricky since we couple the command with the
instance order.

Maybe we can expand the command to receive some arguments as to instance
order.

What do you think?


Nadav Shatz
Tailor Brands | CTO


On Tue, Aug 26, 2025 at 4:42 AM Tatsuo Ishii <[email protected]> wrote:

> Hi Nadav,
>
> Thank you for updating the patch. I will look into that.
>
> I have a question. Have you actually tried the patch with AWS Aurora?
> I am wondering how patched pgpool works with Aurora. I am asking
> because in the doc "8.5. Aurora Configuration Example":
>
>  Set sr_check_period to 0 to disable streaming replication delay
>  checking. This is because Aurora does not provide necessary functions
>  to check the replication delay.
>
>  sr_check_period = 0
>
> So streaming replication checking is disabled, and it means that your
> patch is also effectively disabled too.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
> > Hi Tatsuo,
> >
> > Thank you for the notes - please find attached an updated version.
> >
> > What do you think?
> >
> > Thanks,
> >
> > On Mon, Aug 25, 2025 at 5:18 AM Tatsuo Ishii <[email protected]>
> wrote:
> >
> >> Hi Nadav,
> >>
> >> Thank you for the patch!
> >>
> >> I have one question. How do you provide a password (sr_check_password)
> >> while executing replication_delay_source_cmd as sr_check_user? In my
> >> understanding replication_delay_source_cmd is executed through su
> >> command in your patch. In this case su command tries to read the
> >> password from terminal. I don't see such a code in the patch.
> >>
> >> BTW, I start to think that executing replication_delay_source_cmd as
> >> sr_check_user might not be a good idea. sr_check_user is a database
> >> user, not OS user. In PostgreSQL they are not necessarily the
> >> same. Also doing su in pgpool process needs to be very carefully to
> >> avoid vulnerability. Probably we just execute it as pgpool OS user?
> >>
> >> Lastly when I apply the patches using git apply, there are some
> >> trailing space errors.
> >>
> >> $ git apply ~/external-lag-feature-implementation.patch
> >> /home/t-ishii/external-lag-feature-implementation.patch:314: trailing
> >> whitespace.
> >>
> >> /home/t-ishii/external-lag-feature-implementation.patch:317: trailing
> >> whitespace.
> >>
> >> /home/t-ishii/external-lag-feature-implementation.patch:318: trailing
> >> whitespace.
> >>                         cmd_len = strlen(escaped_cmd) +
> >> /home/t-ishii/external-lag-feature-implementation.patch:320: trailing
> >> whitespace.
> >>
> >> /home/t-ishii/external-lag-feature-implementation.patch:322: trailing
> >> whitespace.
> >>                         snprintf(full_command, cmd_len, "su - %s -c
> '%s'",
> >> warning: squelched 4 whitespace errors
> >> warning: 9 lines add whitespace errors.
> >>
> >> $ git apply ~/external-lag-feature-tests.patch
> >> /home/t-ishii/external-lag-feature-tests.patch:87: trailing whitespace.
> >> - test_parsing.sh: Unit test for parsing logic
> >> /home/t-ishii/external-lag-feature-tests.patch:440: trailing whitespace.
> >> # Test 2: Float values
> >> warning: 2 lines add whitespace errors.
> >>
> >> Also I have some compilation errors after patching the source
> >> code. See attached compilation log.
> >>
> >> Best regards,
> >> --
> >> Tatsuo Ishii
> >> SRA OSS K.K.
> >> English: http://www.sraoss.co.jp/index_en/
> >> Japanese:http://www.sraoss.co.jp
> >>
> >
> >
> > --
> > Nadav Shatz
> > Tailor Brands | CTO
>


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-09-01 13:34  Nadav Shatz <[email protected]>
  parent: Nadav Shatz <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Nadav Shatz @ 2025-09-01 13:34 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

Hi Tatsuo,

I don't want to rush at all - did you get a chance to look at what I sent?
Can I share more relevant information with you?

What do you think?

On Tue, Aug 26, 2025 at 9:54 AM Nadav Shatz <[email protected]> wrote:

> Hi Tatsuo,
>
> I haven’t tried it yet but the whole premise of having it run a command is
> that it’s not dependent on the specific DB. As you mentioned earlier.
>
> The issue blocking the regular lag extraction from aurora is that it
> doesn’t update the tables in the DB. It does have a CloudWatch API to get
> the numbers tho.
>
> You can see the metric AuroraReplicaLag under
>
> https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/metrics-reference.html
>
> So if we have a simple command to either get it or have something else
> update a file with the numbers based on it we’ll be fine.
>
> Ordering here could get tricky since we couple the command with the
> instance order.
>
> Maybe we can expand the command to receive some arguments as to instance
> order.
>
> What do you think?
>
>
> Nadav Shatz
> Tailor Brands | CTO
>
>
> On Tue, Aug 26, 2025 at 4:42 AM Tatsuo Ishii <[email protected]> wrote:
>
>> Hi Nadav,
>>
>> Thank you for updating the patch. I will look into that.
>>
>> I have a question. Have you actually tried the patch with AWS Aurora?
>> I am wondering how patched pgpool works with Aurora. I am asking
>> because in the doc "8.5. Aurora Configuration Example":
>>
>>  Set sr_check_period to 0 to disable streaming replication delay
>>  checking. This is because Aurora does not provide necessary functions
>>  to check the replication delay.
>>
>>  sr_check_period = 0
>>
>> So streaming replication checking is disabled, and it means that your
>> patch is also effectively disabled too.
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
>> > Hi Tatsuo,
>> >
>> > Thank you for the notes - please find attached an updated version.
>> >
>> > What do you think?
>> >
>> > Thanks,
>> >
>> > On Mon, Aug 25, 2025 at 5:18 AM Tatsuo Ishii <[email protected]>
>> wrote:
>> >
>> >> Hi Nadav,
>> >>
>> >> Thank you for the patch!
>> >>
>> >> I have one question. How do you provide a password (sr_check_password)
>> >> while executing replication_delay_source_cmd as sr_check_user? In my
>> >> understanding replication_delay_source_cmd is executed through su
>> >> command in your patch. In this case su command tries to read the
>> >> password from terminal. I don't see such a code in the patch.
>> >>
>> >> BTW, I start to think that executing replication_delay_source_cmd as
>> >> sr_check_user might not be a good idea. sr_check_user is a database
>> >> user, not OS user. In PostgreSQL they are not necessarily the
>> >> same. Also doing su in pgpool process needs to be very carefully to
>> >> avoid vulnerability. Probably we just execute it as pgpool OS user?
>> >>
>> >> Lastly when I apply the patches using git apply, there are some
>> >> trailing space errors.
>> >>
>> >> $ git apply ~/external-lag-feature-implementation.patch
>> >> /home/t-ishii/external-lag-feature-implementation.patch:314: trailing
>> >> whitespace.
>> >>
>> >> /home/t-ishii/external-lag-feature-implementation.patch:317: trailing
>> >> whitespace.
>> >>
>> >> /home/t-ishii/external-lag-feature-implementation.patch:318: trailing
>> >> whitespace.
>> >>                         cmd_len = strlen(escaped_cmd) +
>> >> /home/t-ishii/external-lag-feature-implementation.patch:320: trailing
>> >> whitespace.
>> >>
>> >> /home/t-ishii/external-lag-feature-implementation.patch:322: trailing
>> >> whitespace.
>> >>                         snprintf(full_command, cmd_len, "su - %s -c
>> '%s'",
>> >> warning: squelched 4 whitespace errors
>> >> warning: 9 lines add whitespace errors.
>> >>
>> >> $ git apply ~/external-lag-feature-tests.patch
>> >> /home/t-ishii/external-lag-feature-tests.patch:87: trailing whitespace.
>> >> - test_parsing.sh: Unit test for parsing logic
>> >> /home/t-ishii/external-lag-feature-tests.patch:440: trailing
>> whitespace.
>> >> # Test 2: Float values
>> >> warning: 2 lines add whitespace errors.
>> >>
>> >> Also I have some compilation errors after patching the source
>> >> code. See attached compilation log.
>> >>
>> >> Best regards,
>> >> --
>> >> Tatsuo Ishii
>> >> SRA OSS K.K.
>> >> English: http://www.sraoss.co.jp/index_en/
>> >> Japanese:http://www.sraoss.co.jp
>> >>
>> >
>> >
>> > --
>> > Nadav Shatz
>> > Tailor Brands | CTO
>>
>

-- 
Nadav Shatz
Tailor Brands | CTO


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-09-01 22:41  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2025-09-01 22:41 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

Hi Nadav,

Sorry for late reply.

>> I haven’t tried it yet but the whole premise of having it run a command is
>> that it’s not dependent on the specific DB. As you mentioned earlier.
>>
>> The issue blocking the regular lag extraction from aurora is that it
>> doesn’t update the tables in the DB. It does have a CloudWatch API to get
>> the numbers tho.

I am not familiar with CloudWatch API and am not sure I fully
understand you issue. What is your issue with CloudWatch API?  Is it a
technical problem, or some cost issue (I guess CloudWatch is a paid
service)?

>> Ordering here could get tricky since we couple the command with the
>> instance order.
>>
>> Maybe we can expand the command to receive some arguments as to instance
>> order.

Can you elaborate what "ordering" is?

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

> Hi Tatsuo,
> 
> I don't want to rush at all - did you get a chance to look at what I sent?
> Can I share more relevant information with you?
> 
> What do you think?
> 
> On Tue, Aug 26, 2025 at 9:54 AM Nadav Shatz <[email protected]> wrote:
> 
>> Hi Tatsuo,
>>
>> I haven’t tried it yet but the whole premise of having it run a command is
>> that it’s not dependent on the specific DB. As you mentioned earlier.
>>
>> The issue blocking the regular lag extraction from aurora is that it
>> doesn’t update the tables in the DB. It does have a CloudWatch API to get
>> the numbers tho.
>>
>> You can see the metric AuroraReplicaLag under
>>
>> https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/metrics-reference.html
>>
>> So if we have a simple command to either get it or have something else
>> update a file with the numbers based on it we’ll be fine.
>>
>> Ordering here could get tricky since we couple the command with the
>> instance order.
>>
>> Maybe we can expand the command to receive some arguments as to instance
>> order.
>>
>> What do you think?
>>
>>
>> Nadav Shatz
>> Tailor Brands | CTO
>>
>>
>> On Tue, Aug 26, 2025 at 4:42 AM Tatsuo Ishii <[email protected]> wrote:
>>
>>> Hi Nadav,
>>>
>>> Thank you for updating the patch. I will look into that.
>>>
>>> I have a question. Have you actually tried the patch with AWS Aurora?
>>> I am wondering how patched pgpool works with Aurora. I am asking
>>> because in the doc "8.5. Aurora Configuration Example":
>>>
>>>  Set sr_check_period to 0 to disable streaming replication delay
>>>  checking. This is because Aurora does not provide necessary functions
>>>  to check the replication delay.
>>>
>>>  sr_check_period = 0
>>>
>>> So streaming replication checking is disabled, and it means that your
>>> patch is also effectively disabled too.
>>>
>>> Best regards,
>>> --
>>> Tatsuo Ishii
>>> SRA OSS K.K.
>>> English: http://www.sraoss.co.jp/index_en/
>>> Japanese:http://www.sraoss.co.jp
>>>
>>> > Hi Tatsuo,
>>> >
>>> > Thank you for the notes - please find attached an updated version.
>>> >
>>> > What do you think?
>>> >
>>> > Thanks,
>>> >
>>> > On Mon, Aug 25, 2025 at 5:18 AM Tatsuo Ishii <[email protected]>
>>> wrote:
>>> >
>>> >> Hi Nadav,
>>> >>
>>> >> Thank you for the patch!
>>> >>
>>> >> I have one question. How do you provide a password (sr_check_password)
>>> >> while executing replication_delay_source_cmd as sr_check_user? In my
>>> >> understanding replication_delay_source_cmd is executed through su
>>> >> command in your patch. In this case su command tries to read the
>>> >> password from terminal. I don't see such a code in the patch.
>>> >>
>>> >> BTW, I start to think that executing replication_delay_source_cmd as
>>> >> sr_check_user might not be a good idea. sr_check_user is a database
>>> >> user, not OS user. In PostgreSQL they are not necessarily the
>>> >> same. Also doing su in pgpool process needs to be very carefully to
>>> >> avoid vulnerability. Probably we just execute it as pgpool OS user?
>>> >>
>>> >> Lastly when I apply the patches using git apply, there are some
>>> >> trailing space errors.
>>> >>
>>> >> $ git apply ~/external-lag-feature-implementation.patch
>>> >> /home/t-ishii/external-lag-feature-implementation.patch:314: trailing
>>> >> whitespace.
>>> >>
>>> >> /home/t-ishii/external-lag-feature-implementation.patch:317: trailing
>>> >> whitespace.
>>> >>
>>> >> /home/t-ishii/external-lag-feature-implementation.patch:318: trailing
>>> >> whitespace.
>>> >>                         cmd_len = strlen(escaped_cmd) +
>>> >> /home/t-ishii/external-lag-feature-implementation.patch:320: trailing
>>> >> whitespace.
>>> >>
>>> >> /home/t-ishii/external-lag-feature-implementation.patch:322: trailing
>>> >> whitespace.
>>> >>                         snprintf(full_command, cmd_len, "su - %s -c
>>> '%s'",
>>> >> warning: squelched 4 whitespace errors
>>> >> warning: 9 lines add whitespace errors.
>>> >>
>>> >> $ git apply ~/external-lag-feature-tests.patch
>>> >> /home/t-ishii/external-lag-feature-tests.patch:87: trailing whitespace.
>>> >> - test_parsing.sh: Unit test for parsing logic
>>> >> /home/t-ishii/external-lag-feature-tests.patch:440: trailing
>>> whitespace.
>>> >> # Test 2: Float values
>>> >> warning: 2 lines add whitespace errors.
>>> >>
>>> >> Also I have some compilation errors after patching the source
>>> >> code. See attached compilation log.
>>> >>
>>> >> Best regards,
>>> >> --
>>> >> Tatsuo Ishii
>>> >> SRA OSS K.K.
>>> >> English: http://www.sraoss.co.jp/index_en/
>>> >> Japanese:http://www.sraoss.co.jp
>>> >>
>>> >
>>> >
>>> > --
>>> > Nadav Shatz
>>> > Tailor Brands | CTO
>>>
>>
> 
> -- 
> Nadav Shatz
> Tailor Brands | CTO


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-09-02 08:32  Nadav Shatz <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Nadav Shatz @ 2025-09-02 08:32 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

Hi,

All good.

The usual way the Pgpool accesses the lag params is through the relevant
tables in the DB. for aurora that isn't available.
The numbers are available directly from AWS API calls tho. This solution
will work with Aurora by circumventing this issue.

What i mentioned as a concern is that since the command doesn't currently
accept the actual DB instance list (primary/replicas) and their order it
can't guarantee it'll return the lag values in the expected order.

Except the primary being the first, how will the running command know the
order in which pgpool has loaded the replicas into it's
memory/configuration?

Hope this makes more sense - if not let me know and i'll provide some
examples.

Thanks,

On Tue, Sep 2, 2025 at 1:41 AM Tatsuo Ishii <[email protected]> wrote:

> Hi Nadav,
>
> Sorry for late reply.
>
> >> I haven’t tried it yet but the whole premise of having it run a command
> is
> >> that it’s not dependent on the specific DB. As you mentioned earlier.
> >>
> >> The issue blocking the regular lag extraction from aurora is that it
> >> doesn’t update the tables in the DB. It does have a CloudWatch API to
> get
> >> the numbers tho.
>
> I am not familiar with CloudWatch API and am not sure I fully
> understand you issue. What is your issue with CloudWatch API?  Is it a
> technical problem, or some cost issue (I guess CloudWatch is a paid
> service)?
>
> >> Ordering here could get tricky since we couple the command with the
> >> instance order.
> >>
> >> Maybe we can expand the command to receive some arguments as to instance
> >> order.
>
> Can you elaborate what "ordering" is?
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
> > Hi Tatsuo,
> >
> > I don't want to rush at all - did you get a chance to look at what I
> sent?
> > Can I share more relevant information with you?
> >
> > What do you think?
> >
> > On Tue, Aug 26, 2025 at 9:54 AM Nadav Shatz <[email protected]>
> wrote:
> >
> >> Hi Tatsuo,
> >>
> >> I haven’t tried it yet but the whole premise of having it run a command
> is
> >> that it’s not dependent on the specific DB. As you mentioned earlier.
> >>
> >> The issue blocking the regular lag extraction from aurora is that it
> >> doesn’t update the tables in the DB. It does have a CloudWatch API to
> get
> >> the numbers tho.
> >>
> >> You can see the metric AuroraReplicaLag under
> >>
> >>
> https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/metrics-reference.html
> >>
> >> So if we have a simple command to either get it or have something else
> >> update a file with the numbers based on it we’ll be fine.
> >>
> >> Ordering here could get tricky since we couple the command with the
> >> instance order.
> >>
> >> Maybe we can expand the command to receive some arguments as to instance
> >> order.
> >>
> >> What do you think?
> >>
> >>
> >> Nadav Shatz
> >> Tailor Brands | CTO
> >>
> >>
> >> On Tue, Aug 26, 2025 at 4:42 AM Tatsuo Ishii <[email protected]>
> wrote:
> >>
> >>> Hi Nadav,
> >>>
> >>> Thank you for updating the patch. I will look into that.
> >>>
> >>> I have a question. Have you actually tried the patch with AWS Aurora?
> >>> I am wondering how patched pgpool works with Aurora. I am asking
> >>> because in the doc "8.5. Aurora Configuration Example":
> >>>
> >>>  Set sr_check_period to 0 to disable streaming replication delay
> >>>  checking. This is because Aurora does not provide necessary functions
> >>>  to check the replication delay.
> >>>
> >>>  sr_check_period = 0
> >>>
> >>> So streaming replication checking is disabled, and it means that your
> >>> patch is also effectively disabled too.
> >>>
> >>> Best regards,
> >>> --
> >>> Tatsuo Ishii
> >>> SRA OSS K.K.
> >>> English: http://www.sraoss.co.jp/index_en/
> >>> Japanese:http://www.sraoss.co.jp
> >>>
> >>> > Hi Tatsuo,
> >>> >
> >>> > Thank you for the notes - please find attached an updated version.
> >>> >
> >>> > What do you think?
> >>> >
> >>> > Thanks,
> >>> >
> >>> > On Mon, Aug 25, 2025 at 5:18 AM Tatsuo Ishii <[email protected]>
> >>> wrote:
> >>> >
> >>> >> Hi Nadav,
> >>> >>
> >>> >> Thank you for the patch!
> >>> >>
> >>> >> I have one question. How do you provide a password
> (sr_check_password)
> >>> >> while executing replication_delay_source_cmd as sr_check_user? In my
> >>> >> understanding replication_delay_source_cmd is executed through su
> >>> >> command in your patch. In this case su command tries to read the
> >>> >> password from terminal. I don't see such a code in the patch.
> >>> >>
> >>> >> BTW, I start to think that executing replication_delay_source_cmd as
> >>> >> sr_check_user might not be a good idea. sr_check_user is a database
> >>> >> user, not OS user. In PostgreSQL they are not necessarily the
> >>> >> same. Also doing su in pgpool process needs to be very carefully to
> >>> >> avoid vulnerability. Probably we just execute it as pgpool OS user?
> >>> >>
> >>> >> Lastly when I apply the patches using git apply, there are some
> >>> >> trailing space errors.
> >>> >>
> >>> >> $ git apply ~/external-lag-feature-implementation.patch
> >>> >> /home/t-ishii/external-lag-feature-implementation.patch:314:
> trailing
> >>> >> whitespace.
> >>> >>
> >>> >> /home/t-ishii/external-lag-feature-implementation.patch:317:
> trailing
> >>> >> whitespace.
> >>> >>
> >>> >> /home/t-ishii/external-lag-feature-implementation.patch:318:
> trailing
> >>> >> whitespace.
> >>> >>                         cmd_len = strlen(escaped_cmd) +
> >>> >> /home/t-ishii/external-lag-feature-implementation.patch:320:
> trailing
> >>> >> whitespace.
> >>> >>
> >>> >> /home/t-ishii/external-lag-feature-implementation.patch:322:
> trailing
> >>> >> whitespace.
> >>> >>                         snprintf(full_command, cmd_len, "su - %s -c
> >>> '%s'",
> >>> >> warning: squelched 4 whitespace errors
> >>> >> warning: 9 lines add whitespace errors.
> >>> >>
> >>> >> $ git apply ~/external-lag-feature-tests.patch
> >>> >> /home/t-ishii/external-lag-feature-tests.patch:87: trailing
> whitespace.
> >>> >> - test_parsing.sh: Unit test for parsing logic
> >>> >> /home/t-ishii/external-lag-feature-tests.patch:440: trailing
> >>> whitespace.
> >>> >> # Test 2: Float values
> >>> >> warning: 2 lines add whitespace errors.
> >>> >>
> >>> >> Also I have some compilation errors after patching the source
> >>> >> code. See attached compilation log.
> >>> >>
> >>> >> Best regards,
> >>> >> --
> >>> >> Tatsuo Ishii
> >>> >> SRA OSS K.K.
> >>> >> English: http://www.sraoss.co.jp/index_en/
> >>> >> Japanese:http://www.sraoss.co.jp
> >>> >>
> >>> >
> >>> >
> >>> > --
> >>> > Nadav Shatz
> >>> > Tailor Brands | CTO
> >>>
> >>
> >
> > --
> > Nadav Shatz
> > Tailor Brands | CTO
>


-- 
Nadav Shatz
Tailor Brands | CTO


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-09-03 23:36  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2025-09-03 23:36 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

Hi,

> Hi,
> 
> All good.
> 
> The usual way the Pgpool accesses the lag params is through the relevant
> tables in the DB. for aurora that isn't available.
> The numbers are available directly from AWS API calls tho. This solution
> will work with Aurora by circumventing this issue.
> 
> What i mentioned as a concern is that since the command doesn't currently
> accept the actual DB instance list (primary/replicas) and their order it
> can't guarantee it'll return the lag values in the expected order.
> 
> Except the primary being the first, how will the running command know the
> order in which pgpool has loaded the replicas into it's
> memory/configuration?
> 
> Hope this makes more sense - if not let me know and i'll provide some
> examples.

Yes, examples would be helpful.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

> Thanks,
> 
> On Tue, Sep 2, 2025 at 1:41 AM Tatsuo Ishii <[email protected]> wrote:
> 
>> Hi Nadav,
>>
>> Sorry for late reply.
>>
>> >> I haven’t tried it yet but the whole premise of having it run a command
>> is
>> >> that it’s not dependent on the specific DB. As you mentioned earlier.
>> >>
>> >> The issue blocking the regular lag extraction from aurora is that it
>> >> doesn’t update the tables in the DB. It does have a CloudWatch API to
>> get
>> >> the numbers tho.
>>
>> I am not familiar with CloudWatch API and am not sure I fully
>> understand you issue. What is your issue with CloudWatch API?  Is it a
>> technical problem, or some cost issue (I guess CloudWatch is a paid
>> service)?
>>
>> >> Ordering here could get tricky since we couple the command with the
>> >> instance order.
>> >>
>> >> Maybe we can expand the command to receive some arguments as to instance
>> >> order.
>>
>> Can you elaborate what "ordering" is?
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
>> > Hi Tatsuo,
>> >
>> > I don't want to rush at all - did you get a chance to look at what I
>> sent?
>> > Can I share more relevant information with you?
>> >
>> > What do you think?
>> >
>> > On Tue, Aug 26, 2025 at 9:54 AM Nadav Shatz <[email protected]>
>> wrote:
>> >
>> >> Hi Tatsuo,
>> >>
>> >> I haven’t tried it yet but the whole premise of having it run a command
>> is
>> >> that it’s not dependent on the specific DB. As you mentioned earlier.
>> >>
>> >> The issue blocking the regular lag extraction from aurora is that it
>> >> doesn’t update the tables in the DB. It does have a CloudWatch API to
>> get
>> >> the numbers tho.
>> >>
>> >> You can see the metric AuroraReplicaLag under
>> >>
>> >>
>> https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/metrics-reference.html
>> >>
>> >> So if we have a simple command to either get it or have something else
>> >> update a file with the numbers based on it we’ll be fine.
>> >>
>> >> Ordering here could get tricky since we couple the command with the
>> >> instance order.
>> >>
>> >> Maybe we can expand the command to receive some arguments as to instance
>> >> order.
>> >>
>> >> What do you think?
>> >>
>> >>
>> >> Nadav Shatz
>> >> Tailor Brands | CTO
>> >>
>> >>
>> >> On Tue, Aug 26, 2025 at 4:42 AM Tatsuo Ishii <[email protected]>
>> wrote:
>> >>
>> >>> Hi Nadav,
>> >>>
>> >>> Thank you for updating the patch. I will look into that.
>> >>>
>> >>> I have a question. Have you actually tried the patch with AWS Aurora?
>> >>> I am wondering how patched pgpool works with Aurora. I am asking
>> >>> because in the doc "8.5. Aurora Configuration Example":
>> >>>
>> >>>  Set sr_check_period to 0 to disable streaming replication delay
>> >>>  checking. This is because Aurora does not provide necessary functions
>> >>>  to check the replication delay.
>> >>>
>> >>>  sr_check_period = 0
>> >>>
>> >>> So streaming replication checking is disabled, and it means that your
>> >>> patch is also effectively disabled too.
>> >>>
>> >>> Best regards,
>> >>> --
>> >>> Tatsuo Ishii
>> >>> SRA OSS K.K.
>> >>> English: http://www.sraoss.co.jp/index_en/
>> >>> Japanese:http://www.sraoss.co.jp
>> >>>
>> >>> > Hi Tatsuo,
>> >>> >
>> >>> > Thank you for the notes - please find attached an updated version.
>> >>> >
>> >>> > What do you think?
>> >>> >
>> >>> > Thanks,
>> >>> >
>> >>> > On Mon, Aug 25, 2025 at 5:18 AM Tatsuo Ishii <[email protected]>
>> >>> wrote:
>> >>> >
>> >>> >> Hi Nadav,
>> >>> >>
>> >>> >> Thank you for the patch!
>> >>> >>
>> >>> >> I have one question. How do you provide a password
>> (sr_check_password)
>> >>> >> while executing replication_delay_source_cmd as sr_check_user? In my
>> >>> >> understanding replication_delay_source_cmd is executed through su
>> >>> >> command in your patch. In this case su command tries to read the
>> >>> >> password from terminal. I don't see such a code in the patch.
>> >>> >>
>> >>> >> BTW, I start to think that executing replication_delay_source_cmd as
>> >>> >> sr_check_user might not be a good idea. sr_check_user is a database
>> >>> >> user, not OS user. In PostgreSQL they are not necessarily the
>> >>> >> same. Also doing su in pgpool process needs to be very carefully to
>> >>> >> avoid vulnerability. Probably we just execute it as pgpool OS user?
>> >>> >>
>> >>> >> Lastly when I apply the patches using git apply, there are some
>> >>> >> trailing space errors.
>> >>> >>
>> >>> >> $ git apply ~/external-lag-feature-implementation.patch
>> >>> >> /home/t-ishii/external-lag-feature-implementation.patch:314:
>> trailing
>> >>> >> whitespace.
>> >>> >>
>> >>> >> /home/t-ishii/external-lag-feature-implementation.patch:317:
>> trailing
>> >>> >> whitespace.
>> >>> >>
>> >>> >> /home/t-ishii/external-lag-feature-implementation.patch:318:
>> trailing
>> >>> >> whitespace.
>> >>> >>                         cmd_len = strlen(escaped_cmd) +
>> >>> >> /home/t-ishii/external-lag-feature-implementation.patch:320:
>> trailing
>> >>> >> whitespace.
>> >>> >>
>> >>> >> /home/t-ishii/external-lag-feature-implementation.patch:322:
>> trailing
>> >>> >> whitespace.
>> >>> >>                         snprintf(full_command, cmd_len, "su - %s -c
>> >>> '%s'",
>> >>> >> warning: squelched 4 whitespace errors
>> >>> >> warning: 9 lines add whitespace errors.
>> >>> >>
>> >>> >> $ git apply ~/external-lag-feature-tests.patch
>> >>> >> /home/t-ishii/external-lag-feature-tests.patch:87: trailing
>> whitespace.
>> >>> >> - test_parsing.sh: Unit test for parsing logic
>> >>> >> /home/t-ishii/external-lag-feature-tests.patch:440: trailing
>> >>> whitespace.
>> >>> >> # Test 2: Float values
>> >>> >> warning: 2 lines add whitespace errors.
>> >>> >>
>> >>> >> Also I have some compilation errors after patching the source
>> >>> >> code. See attached compilation log.
>> >>> >>
>> >>> >> Best regards,
>> >>> >> --
>> >>> >> Tatsuo Ishii
>> >>> >> SRA OSS K.K.
>> >>> >> English: http://www.sraoss.co.jp/index_en/
>> >>> >> Japanese:http://www.sraoss.co.jp
>> >>> >>
>> >>> >
>> >>> >
>> >>> > --
>> >>> > Nadav Shatz
>> >>> > Tailor Brands | CTO
>> >>>
>> >>
>> >
>> > --
>> > Nadav Shatz
>> > Tailor Brands | CTO
>>
> 
> 
> -- 
> Nadav Shatz
> Tailor Brands | CTO


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-09-07 08:52  Nadav Shatz <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Nadav Shatz @ 2025-09-07 08:52 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

Hi Tatsuo,

Thanks for getting back to me. Let me clarify the ordering concern and
provide an example to make it clearer:

Currently, replication_delay_source_cmd executes without awareness of the
replica list or the order in which Pgpool loads them. For Aurora, since
we’re bypassing the internal DB tables and fetching lag data directly via
the AWS CloudWatch API, we need to ensure the returned lag values are
mapped to the correct instances.

For example, assume Pgpool has the following configuration:

primary: db-primary
replicas: db-replica-a, db-replica-b, db-replica-c

If the command retrieves lag values [15, 120, 60] from CloudWatch, we need
to guarantee these are consistently mapped as:


   -

   db-replica-a → 15ms
   -

   db-replica-b → 120ms
   -

   db-replica-c → 60ms

Without explicitly passing the instance identifiers and their order to the
command, there’s a risk that mismatched ordering will cause Pgpool to make
incorrect routing decisions.

To address this, I suggest extending replication_delay_source_cmd to accept
an ordered list of instance identifiers as arguments. This way, the command
can fetch the metrics in the same sequence Pgpool expects, ensuring
alignment between configuration and returned data.

Would you agree this approach makes sense? If so, I can provide an updated
patch to demonstrate how the command would handle ordered instance mapping.


Best regards,

On Thu, Sep 4, 2025 at 2:36 AM Tatsuo Ishii <[email protected]> wrote:

> Hi,
>
> > Hi,
> >
> > All good.
> >
> > The usual way the Pgpool accesses the lag params is through the relevant
> > tables in the DB. for aurora that isn't available.
> > The numbers are available directly from AWS API calls tho. This solution
> > will work with Aurora by circumventing this issue.
> >
> > What i mentioned as a concern is that since the command doesn't currently
> > accept the actual DB instance list (primary/replicas) and their order it
> > can't guarantee it'll return the lag values in the expected order.
> >
> > Except the primary being the first, how will the running command know the
> > order in which pgpool has loaded the replicas into it's
> > memory/configuration?
> >
> > Hope this makes more sense - if not let me know and i'll provide some
> > examples.
>
> Yes, examples would be helpful.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
> > Thanks,
> >
> > On Tue, Sep 2, 2025 at 1:41 AM Tatsuo Ishii <[email protected]>
> wrote:
> >
> >> Hi Nadav,
> >>
> >> Sorry for late reply.
> >>
> >> >> I haven’t tried it yet but the whole premise of having it run a
> command
> >> is
> >> >> that it’s not dependent on the specific DB. As you mentioned earlier.
> >> >>
> >> >> The issue blocking the regular lag extraction from aurora is that it
> >> >> doesn’t update the tables in the DB. It does have a CloudWatch API to
> >> get
> >> >> the numbers tho.
> >>
> >> I am not familiar with CloudWatch API and am not sure I fully
> >> understand you issue. What is your issue with CloudWatch API?  Is it a
> >> technical problem, or some cost issue (I guess CloudWatch is a paid
> >> service)?
> >>
> >> >> Ordering here could get tricky since we couple the command with the
> >> >> instance order.
> >> >>
> >> >> Maybe we can expand the command to receive some arguments as to
> instance
> >> >> order.
> >>
> >> Can you elaborate what "ordering" is?
> >>
> >> Best regards,
> >> --
> >> Tatsuo Ishii
> >> SRA OSS K.K.
> >> English: http://www.sraoss.co.jp/index_en/
> >> Japanese:http://www.sraoss.co.jp
> >>
> >> > Hi Tatsuo,
> >> >
> >> > I don't want to rush at all - did you get a chance to look at what I
> >> sent?
> >> > Can I share more relevant information with you?
> >> >
> >> > What do you think?
> >> >
> >> > On Tue, Aug 26, 2025 at 9:54 AM Nadav Shatz <[email protected]>
> >> wrote:
> >> >
> >> >> Hi Tatsuo,
> >> >>
> >> >> I haven’t tried it yet but the whole premise of having it run a
> command
> >> is
> >> >> that it’s not dependent on the specific DB. As you mentioned earlier.
> >> >>
> >> >> The issue blocking the regular lag extraction from aurora is that it
> >> >> doesn’t update the tables in the DB. It does have a CloudWatch API to
> >> get
> >> >> the numbers tho.
> >> >>
> >> >> You can see the metric AuroraReplicaLag under
> >> >>
> >> >>
> >>
> https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/metrics-reference.html
> >> >>
> >> >> So if we have a simple command to either get it or have something
> else
> >> >> update a file with the numbers based on it we’ll be fine.
> >> >>
> >> >> Ordering here could get tricky since we couple the command with the
> >> >> instance order.
> >> >>
> >> >> Maybe we can expand the command to receive some arguments as to
> instance
> >> >> order.
> >> >>
> >> >> What do you think?
> >> >>
> >> >>
> >> >> Nadav Shatz
> >> >> Tailor Brands | CTO
> >> >>
> >> >>
> >> >> On Tue, Aug 26, 2025 at 4:42 AM Tatsuo Ishii <[email protected]>
> >> wrote:
> >> >>
> >> >>> Hi Nadav,
> >> >>>
> >> >>> Thank you for updating the patch. I will look into that.
> >> >>>
> >> >>> I have a question. Have you actually tried the patch with AWS
> Aurora?
> >> >>> I am wondering how patched pgpool works with Aurora. I am asking
> >> >>> because in the doc "8.5. Aurora Configuration Example":
> >> >>>
> >> >>>  Set sr_check_period to 0 to disable streaming replication delay
> >> >>>  checking. This is because Aurora does not provide necessary
> functions
> >> >>>  to check the replication delay.
> >> >>>
> >> >>>  sr_check_period = 0
> >> >>>
> >> >>> So streaming replication checking is disabled, and it means that
> your
> >> >>> patch is also effectively disabled too.
> >> >>>
> >> >>> Best regards,
> >> >>> --
> >> >>> Tatsuo Ishii
> >> >>> SRA OSS K.K.
> >> >>> English: http://www.sraoss.co.jp/index_en/
> >> >>> Japanese:http://www.sraoss.co.jp
> >> >>>
> >> >>> > Hi Tatsuo,
> >> >>> >
> >> >>> > Thank you for the notes - please find attached an updated version.
> >> >>> >
> >> >>> > What do you think?
> >> >>> >
> >> >>> > Thanks,
> >> >>> >
> >> >>> > On Mon, Aug 25, 2025 at 5:18 AM Tatsuo Ishii <
> [email protected]>
> >> >>> wrote:
> >> >>> >
> >> >>> >> Hi Nadav,
> >> >>> >>
> >> >>> >> Thank you for the patch!
> >> >>> >>
> >> >>> >> I have one question. How do you provide a password
> >> (sr_check_password)
> >> >>> >> while executing replication_delay_source_cmd as sr_check_user?
> In my
> >> >>> >> understanding replication_delay_source_cmd is executed through su
> >> >>> >> command in your patch. In this case su command tries to read the
> >> >>> >> password from terminal. I don't see such a code in the patch.
> >> >>> >>
> >> >>> >> BTW, I start to think that executing
> replication_delay_source_cmd as
> >> >>> >> sr_check_user might not be a good idea. sr_check_user is a
> database
> >> >>> >> user, not OS user. In PostgreSQL they are not necessarily the
> >> >>> >> same. Also doing su in pgpool process needs to be very carefully
> to
> >> >>> >> avoid vulnerability. Probably we just execute it as pgpool OS
> user?
> >> >>> >>
> >> >>> >> Lastly when I apply the patches using git apply, there are some
> >> >>> >> trailing space errors.
> >> >>> >>
> >> >>> >> $ git apply ~/external-lag-feature-implementation.patch
> >> >>> >> /home/t-ishii/external-lag-feature-implementation.patch:314:
> >> trailing
> >> >>> >> whitespace.
> >> >>> >>
> >> >>> >> /home/t-ishii/external-lag-feature-implementation.patch:317:
> >> trailing
> >> >>> >> whitespace.
> >> >>> >>
> >> >>> >> /home/t-ishii/external-lag-feature-implementation.patch:318:
> >> trailing
> >> >>> >> whitespace.
> >> >>> >>                         cmd_len = strlen(escaped_cmd) +
> >> >>> >> /home/t-ishii/external-lag-feature-implementation.patch:320:
> >> trailing
> >> >>> >> whitespace.
> >> >>> >>
> >> >>> >> /home/t-ishii/external-lag-feature-implementation.patch:322:
> >> trailing
> >> >>> >> whitespace.
> >> >>> >>                         snprintf(full_command, cmd_len, "su - %s
> -c
> >> >>> '%s'",
> >> >>> >> warning: squelched 4 whitespace errors
> >> >>> >> warning: 9 lines add whitespace errors.
> >> >>> >>
> >> >>> >> $ git apply ~/external-lag-feature-tests.patch
> >> >>> >> /home/t-ishii/external-lag-feature-tests.patch:87: trailing
> >> whitespace.
> >> >>> >> - test_parsing.sh: Unit test for parsing logic
> >> >>> >> /home/t-ishii/external-lag-feature-tests.patch:440: trailing
> >> >>> whitespace.
> >> >>> >> # Test 2: Float values
> >> >>> >> warning: 2 lines add whitespace errors.
> >> >>> >>
> >> >>> >> Also I have some compilation errors after patching the source
> >> >>> >> code. See attached compilation log.
> >> >>> >>
> >> >>> >> Best regards,
> >> >>> >> --
> >> >>> >> Tatsuo Ishii
> >> >>> >> SRA OSS K.K.
> >> >>> >> English: http://www.sraoss.co.jp/index_en/
> >> >>> >> Japanese:http://www.sraoss.co.jp
> >> >>> >>
> >> >>> >
> >> >>> >
> >> >>> > --
> >> >>> > Nadav Shatz
> >> >>> > Tailor Brands | CTO
> >> >>>
> >> >>
> >> >
> >> > --
> >> > Nadav Shatz
> >> > Tailor Brands | CTO
> >>
> >
> >
> > --
> > Nadav Shatz
> > Tailor Brands | CTO
>


-- 
Nadav Shatz
Tailor Brands | CTO


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-09-08 00:26  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2025-09-08 00:26 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]; [email protected]

Hi Nadav,

> Hi Tatsuo,
> 
> Thanks for getting back to me. Let me clarify the ordering concern and
> provide an example to make it clearer:
> 
> Currently, replication_delay_source_cmd executes without awareness of the
> replica list or the order in which Pgpool loads them. For Aurora, since
> we’re bypassing the internal DB tables and fetching lag data directly via
> the AWS CloudWatch API, we need to ensure the returned lag values are
> mapped to the correct instances.
> 
> For example, assume Pgpool has the following configuration:
> 
> primary: db-primary
> replicas: db-replica-a, db-replica-b, db-replica-c
> 
> If the command retrieves lag values [15, 120, 60] from CloudWatch, we need
> to guarantee these are consistently mapped as:
> 
> 
>    -
> 
>    db-replica-a → 15ms
>    -
> 
>    db-replica-b → 120ms
>    -
> 
>    db-replica-c → 60ms
> 
> Without explicitly passing the instance identifiers and their order to the
> command, there’s a risk that mismatched ordering will cause Pgpool to make
> incorrect routing decisions.
> 
> To address this, I suggest extending replication_delay_source_cmd to accept
> an ordered list of instance identifiers as arguments. This way, the command
> can fetch the metrics in the same sequence Pgpool expects, ensuring
> alignment between configuration and returned data.

Thanks for the clarification. Previously I misunderstood that Aurora
only provides "reader endpoint", which made me think your proposal to
be impossible. But after some research , I found that Aurora also
provides "cluster endpoint" which refers to each replica instance.  So
let me check if my understanding is
correct. replication_delay_source_cmd will be invoked as:

replication_delay_source_cmd db-replica-a db-replica-b db-replica-c

> Would you agree this approach makes sense?

Yes.

> If so, I can provide an updated
> patch to demonstrate how the command would handle ordered instance mapping.

Thanks. That would be good.

BTW, There are minor points regarding your previous patch. In the patch

083.external_replication_delay/

is the test directory. This does not fit in with our test
infrastructure tradition. Tests for new features should be added
between 001 and 049. 050 and greater are reserved for tests for bug
fixes. So at this point, 041 is appropreate (if other test for a new
feature is added before your patch is committed, you need to adjust
the number of course).

You need to include a patch for documentation. You don't need to write
Japanese doc (doc.ja). We will create it from the English document
later on.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp





^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-09-08 09:50  Nadav Shatz <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 2 replies; 61+ messages in thread

From: Nadav Shatz @ 2025-09-08 09:50 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

Hi Tatsuo,

Please find attached the 3 patch files (implementation, tests, docs) with
the updates we discussed.

What do you think?

Best,

On Mon, Sep 8, 2025 at 3:26 AM Tatsuo Ishii <[email protected]> wrote:

> Hi Nadav,
>
> > Hi Tatsuo,
> >
> > Thanks for getting back to me. Let me clarify the ordering concern and
> > provide an example to make it clearer:
> >
> > Currently, replication_delay_source_cmd executes without awareness of the
> > replica list or the order in which Pgpool loads them. For Aurora, since
> > we’re bypassing the internal DB tables and fetching lag data directly via
> > the AWS CloudWatch API, we need to ensure the returned lag values are
> > mapped to the correct instances.
> >
> > For example, assume Pgpool has the following configuration:
> >
> > primary: db-primary
> > replicas: db-replica-a, db-replica-b, db-replica-c
> >
> > If the command retrieves lag values [15, 120, 60] from CloudWatch, we
> need
> > to guarantee these are consistently mapped as:
> >
> >
> >    -
> >
> >    db-replica-a → 15ms
> >    -
> >
> >    db-replica-b → 120ms
> >    -
> >
> >    db-replica-c → 60ms
> >
> > Without explicitly passing the instance identifiers and their order to
> the
> > command, there’s a risk that mismatched ordering will cause Pgpool to
> make
> > incorrect routing decisions.
> >
> > To address this, I suggest extending replication_delay_source_cmd to
> accept
> > an ordered list of instance identifiers as arguments. This way, the
> command
> > can fetch the metrics in the same sequence Pgpool expects, ensuring
> > alignment between configuration and returned data.
>
> Thanks for the clarification. Previously I misunderstood that Aurora
> only provides "reader endpoint", which made me think your proposal to
> be impossible. But after some research , I found that Aurora also
> provides "cluster endpoint" which refers to each replica instance.  So
> let me check if my understanding is
> correct. replication_delay_source_cmd will be invoked as:
>
> replication_delay_source_cmd db-replica-a db-replica-b db-replica-c
>
> > Would you agree this approach makes sense?
>
> Yes.
>
> > If so, I can provide an updated
> > patch to demonstrate how the command would handle ordered instance
> mapping.
>
> Thanks. That would be good.
>
> BTW, There are minor points regarding your previous patch. In the patch
>
> 083.external_replication_delay/
>
> is the test directory. This does not fit in with our test
> infrastructure tradition. Tests for new features should be added
> between 001 and 049. 050 and greater are reserved for tests for bug
> fixes. So at this point, 041 is appropreate (if other test for a new
> feature is added before your patch is committed, you need to adjust
> the number of course).
>
> You need to include a patch for documentation. You don't need to write
> Japanese doc (doc.ja). We will create it from the English document
> later on.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>


-- 
Nadav Shatz
Tailor Brands | CTO


Attachments:

  [application/octet-stream] 0001-feat-add-external-command-replication-delay-source-f.patch (16.1K, 3-0001-feat-add-external-command-replication-delay-source-f.patch)
  download | inline diff:
From 03c27743219240ea9050f604f5e7fcae74992c2c Mon Sep 17 00:00:00 2001
From: Nadav Shatz <[email protected]>
Date: Mon, 8 Sep 2025 10:33:49 +0300
Subject: [PATCH 1/3] feat: add external command replication delay source
 feature

- Add replication_delay_source_cmd configuration option
- Support external commands for replication delay detection
- Pass ordered instance identifiers to external commands
- Update configuration files and samples
---
 src/config/pool_config_variables.c            |  38 ++
 src/include/pool_config.h                     |   9 +
 src/sample/pgpool.conf.sample-stream          |  16 +
 src/streaming_replication/pool_worker_child.c | 363 +++++++++++++++++-
 4 files changed, 424 insertions(+), 2 deletions(-)

diff --git a/src/config/pool_config_variables.c b/src/config/pool_config_variables.c
index 5bbe46d3a..d2be434e0 100644
--- a/src/config/pool_config_variables.c
+++ b/src/config/pool_config_variables.c
@@ -310,6 +310,12 @@ static const struct config_enum_entry check_temp_table_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry replication_delay_source_options[] = {
+	{"builtin", REPLICATION_DELAY_BUILTIN, false},
+	{"cmd", REPLICATION_DELAY_CMD, false},
+	{NULL, 0, false}
+};
+
 static const struct config_enum_entry log_backend_messages_options[] = {
 	{"none", BGMSG_NONE, false},	/* turn off logging */
 	{"terse", BGMSG_TERSE, false},	/* terse logging (repeated messages are
@@ -980,6 +986,16 @@ static struct config_string ConfigureNamesString[] =
 		NULL, NULL, NULL, NULL
 	},
 
+	{
+		{"replication_delay_source_cmd", CFGCXT_RELOAD, STREAMING_REPLICATION_CONFIG,
+			"External command to retrieve replication delay information.",
+			CONFIG_VAR_TYPE_STRING, false, 0
+		},
+		&g_pool_config.replication_delay_source_cmd,
+		"",
+		NULL, NULL, NULL, NULL
+	},
+
 	{
 		{"failback_command", CFGCXT_RELOAD, FAILOVER_CONFIG,
 			"Command to execute when backend node is attached.",
@@ -2323,6 +2339,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"replication_delay_source_timeout", CFGCXT_RELOAD, STREAMING_REPLICATION_CONFIG,
+			"Timeout for external replication delay command execution in seconds.",
+			CONFIG_VAR_TYPE_INT, false, 0
+		},
+		&g_pool_config.replication_delay_source_timeout,
+		10,
+		1, 3600,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	EMPTY_CONFIG_INT
 };
@@ -2485,6 +2512,17 @@ static struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL, NULL
 	},
 
+	{
+		{"replication_delay_source", CFGCXT_RELOAD, STREAMING_REPLICATION_CONFIG,
+			"Source of replication delay information.",
+			CONFIG_VAR_TYPE_ENUM, false, 0
+		},
+		(int *) &g_pool_config.replication_delay_source,
+		REPLICATION_DELAY_BUILTIN,
+		replication_delay_source_options,
+		NULL, NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	EMPTY_CONFIG_ENUM
 };
diff --git a/src/include/pool_config.h b/src/include/pool_config.h
index be82750e5..1a8262dd7 100644
--- a/src/include/pool_config.h
+++ b/src/include/pool_config.h
@@ -94,6 +94,12 @@ typedef enum LogStandbyDelayModes
 	LSD_NONE
 } LogStandbyDelayModes;
 
+typedef enum ReplicationDelaySourceModes
+{
+	REPLICATION_DELAY_BUILTIN = 1,
+	REPLICATION_DELAY_CMD
+} ReplicationDelaySourceModes;
+
 
 typedef enum MemCacheMethod
 {
@@ -371,6 +377,9 @@ typedef struct
 	char	   *sr_check_password;	/* password for sr_check_user */
 	char	   *sr_check_database;	/* PostgreSQL database name for streaming
 									 * replication check */
+	int			replication_delay_source;	/* replication delay source: builtin or cmd */
+	char	   *replication_delay_source_cmd;	/* external command for replication delay */
+	int			replication_delay_source_timeout;	/* timeout for external command in seconds */
 	char	   *failover_command;	/* execute command when failover happens */
 	char	   *follow_primary_command; /* execute command when failover is
 										 * ended */
diff --git a/src/sample/pgpool.conf.sample-stream b/src/sample/pgpool.conf.sample-stream
index a7eb594c9..662e767a6 100644
--- a/src/sample/pgpool.conf.sample-stream
+++ b/src/sample/pgpool.conf.sample-stream
@@ -519,6 +519,22 @@ backend_clustering_mode = streaming_replication
 
 #sr_check_database = 'postgres'
                                    # Database name for streaming replication check
+
+#replication_delay_source = 'builtin'
+                                   # Source of replication delay information
+                                   # 'builtin': use built-in database queries (default)
+                                   # 'cmd': use external command
+#replication_delay_source_cmd = ''
+                                   # External command to retrieve replication delay information
+                                   # Only used when replication_delay_source = 'cmd'
+                                   # Command should output delay values in milliseconds
+                                   # Format: "0 20 10" (node0 node1 node2 delays)
+                                   # Command runs as the pgpool process user
+#replication_delay_source_timeout = 10
+                                   # Timeout for external command execution in seconds
+                                   # Only used when replication_delay_source = 'cmd'
+                                   # Range: 1-3600 seconds (default: 10)
+
 #delay_threshold = 0
                                    # Threshold before not dispatching query to standby node
                                    # Unit is in bytes
diff --git a/src/streaming_replication/pool_worker_child.c b/src/streaming_replication/pool_worker_child.c
index 4f8f823a3..d19bb1773 100644
--- a/src/streaming_replication/pool_worker_child.c
+++ b/src/streaming_replication/pool_worker_child.c
@@ -76,6 +76,9 @@ static volatile sig_atomic_t restart_request = 0;
 static void establish_persistent_connection(void);
 static void discard_persistent_connection(void);
 static void check_replication_time_lag(void);
+static void check_replication_time_lag_with_cmd(void);
+static char *shell_single_quote(const char *src);
+static char *get_instance_identifier_for_node(int node_id);
 static void CheckReplicationTimeLagErrorCb(void *arg);
 static unsigned long long int text_to_lsn(char *text);
 static RETSIGTYPE my_signal_handler(int sig);
@@ -259,7 +262,10 @@ do_worker_child(void)
 					POOL_NODE_STATUS *node_status;
 					int			i;
 
-					/* Do replication time lag checking */
+				/* Do replication time lag checking */
+				if (pool_config->replication_delay_source == REPLICATION_DELAY_CMD)
+					check_replication_time_lag_with_cmd();
+				else
 					check_replication_time_lag();
 
 					/* Check node status */
@@ -659,10 +665,363 @@ check_replication_time_lag(void)
 	error_context_stack = callback.previous;
 }
 
+#define MAX_CMD_OUTPUT 4096
+#define MAX_REASONABLE_DELAY_MS 3600000.0  /* 1 hour in milliseconds */
+
+/* Global variable to track command timeout */
+static volatile sig_atomic_t command_timeout_occurred = 0;
+
+/*
+ * Signal handler for command timeou
+ */
+static void
+command_timeout_handler(int sig)
+{
+	command_timeout_occurred = 1;
+}
+
+
+
+/*
+ * Check replication time lag using external command
+ */
+static void
+check_replication_time_lag_with_cmd(void)
+{
+	FILE		   *fp;
+	char		   *command;
+	char		   *line;
+	char		   *token;
+	char		   *saveptr;
+	int				node_id;
+	double			delay_ms;
+	uint64			delay;
+	int				token_count = 0;
+	BackendInfo	   *bkinfo;
+	ErrorContextCallback callback;
+
+	if (NUM_BACKENDS <= 1)
+	{
+		/* If there's only one node, there's no point to do checking */
+		return;
+	}
+
+	if (REAL_PRIMARY_NODE_ID < 0)
+	{
+		/* No need to check if there's no primary */
+		return;
+	}
+
+	if (!VALID_BACKEND(REAL_PRIMARY_NODE_ID))
+	{
+		/* No need to check replication delay if primary is down */
+		return;
+	}
+
+	if (!pool_config->replication_delay_source_cmd ||
+		strlen(pool_config->replication_delay_source_cmd) == 0)
+	{
+		ereport(WARNING,
+				(errmsg("replication_delay_source is set to 'cmd' but replication_delay_source_cmd is not configured"),
+				 errhint("Set replication_delay_source_cmd or change replication_delay_source to 'builtin'")));
+		/* Fall back to builtin method */
+		check_replication_time_lag();
+		return;
+	}
+
+	/* Allocate buffer for command output */
+	line = palloc(MAX_CMD_OUTPUT);
+
+	/*
+	 * Register a error context callback to throw proper context message
+	 */
+	callback.callback = CheckReplicationTimeLagErrorCb;
+	callback.arg = NULL;
+	callback.previous = error_context_stack;
+	error_context_stack = &callback;
+
+	/* Execute command as current process user */
+		PG_TRY();
+		{
+			const char *base_command = pool_config->replication_delay_source_cmd;
+
+			/* Build command with ordered instance identifiers as arguments */
+			size_t total_len = strlen(base_command) + 1; /* +1 for NUL */
+			for (int i = 0; i < NUM_BACKENDS; i++)
+			{
+				char *ident = get_instance_identifier_for_node(i);
+				char *q = shell_single_quote(ident);
+				total_len += 1 /* space */ + strlen(q);
+				pfree(ident);
+				pfree(q);
+			}
+
+			command = palloc(total_len);
+			command[0] = '\0';
+			strlcpy(command, base_command, total_len);
+			for (int i = 0; i < NUM_BACKENDS; i++)
+			{
+				char *ident = get_instance_identifier_for_node(i);
+				char *q = shell_single_quote(ident);
+				strlcat(command, " ", total_len);
+				strlcat(command, q, total_len);
+				pfree(ident);
+				pfree(q);
+			}
+
+			ereport(DEBUG1,
+					(errmsg("executing replication delay command: %s", command)));
+
+		/* Set up timeout for command execution */
+		command_timeout_occurred = 0;
+		signal(SIGALRM, command_timeout_handler);
+		alarm(pool_config->replication_delay_source_timeout);
+
+		fp = popen(command, "r");
+		if (fp == NULL)
+		{
+			alarm(0); /* Cancel alarm */
+			signal(SIGALRM, SIG_DFL);
+			ereport(ERROR,
+					(errmsg("failed to execute replication delay command: %s", command),
+					 errdetail("popen failed: %m")));
+		}
+
+		if (fgets(line, MAX_CMD_OUTPUT, fp) == NULL)
+		{
+			int pclose_result = pclose(fp);
+			fp = NULL;
+			alarm(0); /* Cancel alarm */
+			signal(SIGALRM, SIG_DFL);
+
+			if (command_timeout_occurred)
+			{
+				ereport(ERROR,
+						(errmsg("replication delay command timed out after %d seconds: %s",
+								pool_config->replication_delay_source_timeout, command),
+						 errhint("Consider increasing replication_delay_source_timeout or optimizing the command")));
+			}
+			else
+			{
+				ereport(ERROR,
+						(errmsg("failed to read output from replication delay command: %s", command),
+						 errdetail("command exit status: %d", pclose_result)));
+			}
+		}
+
+		alarm(0); /* Cancel alarm */
+		signal(SIGALRM, SIG_DFL);
+
+		/* Check if output was truncated */
+		if (strlen(line) == MAX_CMD_OUTPUT - 1 && line[MAX_CMD_OUTPUT - 2] != '\n')
+		{
+			ereport(WARNING,
+					(errmsg("replication delay command output may have been truncated")));
+		}
+
+		pclose(fp);
+		fp = NULL;
+
+		/* Parse the output format "0 20 10" where each number is delay in milliseconds for nodes 0, 1, 2 etc */
+		/* Count tokens first for validation */
+		char *line_copy = pstrdup(line);
+		char *temp_token = strtok(line_copy, " \t\n");
+		while (temp_token != NULL)
+		{
+			token_count++;
+			temp_token = strtok(NULL, " \t\n");
+		}
+		pfree(line_copy);
+
+		/* Now parse the actual tokens */
+		token = strtok_r(line, " \t\n", &saveptr);
+		node_id = 0;
+
+		if (token_count != NUM_BACKENDS)
+		{
+			ereport(WARNING,
+					(errmsg("replication delay command returned %d values, expected %d",
+							token_count, NUM_BACKENDS),
+					 errhint("Command should output one delay value per backend node")));
+		}
+
+		while (token != NULL && node_id < NUM_BACKENDS)
+		{
+			if (!VALID_BACKEND(node_id))
+			{
+				node_id++;
+				token = strtok_r(NULL, " \t\n", &saveptr);
+				continue;
+			}
+
+			char *endptr;
+			delay_ms = strtod(token, &endptr);
+
+			/* Validate the conversion */
+			if (*endptr != '\0')
+			{
+				ereport(WARNING,
+						(errmsg("invalid delay value '%s' for node %d, treating as 0",
+								token, node_id)));
+				delay_ms = 0;
+			}
+
+			/* Validate delay value range */
+			if (delay_ms < 0)
+			{
+				ereport(WARNING,
+						(errmsg("negative delay value %.3f for node %d, treating as 0",
+								delay_ms, node_id)));
+				delay_ms = 0;
+			}
+			else if (delay_ms > MAX_REASONABLE_DELAY_MS)
+			{
+				ereport(WARNING,
+						(errmsg("extremely large delay value %.3f for node %d",
+								delay_ms, node_id)));
+			}
+
+			bkinfo = pool_get_node_info(node_id);
+
+			if (PRIMARY_NODE_ID == node_id)
+			{
+				/* Primary node should always have 0 delay */
+				bkinfo->standby_delay = 0;
+				if (delay_ms > 0)
+				{
+					ereport(DEBUG1,
+							(errmsg("primary node %d reported non-zero delay %.3f, setting to 0",
+									node_id, delay_ms)));
+				}
+			}
+			else
+			{
+				/* Convert delay from milliseconds to microseconds for internal storage */
+				delay = (uint64)(delay_ms * 1000);
+				bkinfo->standby_delay = delay;
+				bkinfo->standby_delay_by_time = true;
+
+				/* Log delay if necessary */
+				uint64 delay_threshold_by_time = pool_config->delay_threshold_by_time * 1000; /* threshold is in milliseconds, convert to microseconds */
+
+				if ((pool_config->log_standby_delay == LSD_ALWAYS && delay_ms > 0) ||
+					(pool_config->log_standby_delay == LSD_OVER_THRESHOLD &&
+					 bkinfo->standby_delay > delay_threshold_by_time))
+				{
+					ereport(LOG,
+							(errmsg("Replication of node: %d is behind %.3f second(s) from the primary server (node: %d) [external command]",
+									node_id, delay_ms / 1000, PRIMARY_NODE_ID)));
+				}
+			}
+
+			node_id++;
+			token = strtok_r(NULL, " \t\n", &saveptr);
+		}
+
+	}
+	PG_CATCH();
+	{
+		/* Cleanup in case of error */
+		alarm(0); /* Cancel any pending alarm */
+		signal(SIGALRM, SIG_DFL);
+		if (fp)
+		{
+			pclose(fp);
+			fp = NULL;
+		}
+		if (line)
+			pfree(line);
+		if (command)
+			pfree(command);
+		error_context_stack = callback.previous;
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	/* Normal cleanup */
+	if (line)
+		pfree(line);
+	if (command)
+		pfree(command);
+
+	error_context_stack = callback.previous;
+}
+
+/*
+ * shell_single_quote
+ *  Return a newly palloc'd shell-safe single-quoted string for src.
+ *  Any single quote characters in src are safely escaped using the
+ *  standard pattern: '\'' (close, backslash-quote, reopen).
+ */
+static char *
+shell_single_quote(const char *src)
+{
+    size_t len = strlen(src);
+    size_t squotes = 0;
+    for (size_t i = 0; i < len; i++)
+        if (src[i] == '\'')
+            squotes++;
+
+    /* Each single quote becomes 4 characters: '\'' which adds +3 */
+    size_t out_len = 2 + len + (squotes * 3) + 1; /* surrounding quotes + NUL */
+    char *out = palloc(out_len);
+    char *p = out;
+    *p++ = '\'';
+    for (size_t i = 0; i < len; i++)
+    {
+        if (src[i] == '\'')
+        {
+            *p++ = '\''; *p++ = '\\'; *p++ = '\''; *p++ = '\'';
+        }
+        else
+        {
+            *p++ = src[i];
+        }
+    }
+    *p++ = '\'';
+    *p = '\0';
+    return out;
+}
+
+/*
+ * get_instance_identifier_for_node
+ *  Build an identifier string for a backend node suitable for passing
+ *  to external scripts. Preference order:
+ *   - backend_application_name if present
+ *   - "<hostname>:<port>"
+ *   - "node<id>"
+ */
+static char *
+get_instance_identifier_for_node(int node_id)
+{
+    BackendInfo *bi = pool_get_node_info(node_id);
+    /* Use application_name if available */
+    if (bi && bi->backend_application_name[0] != '\0')
+    {
+        return pstrdup(bi->backend_application_name);
+    }
+
+    /* Otherwise use hostname:port if hostname is set */
+    if (bi && bi->backend_hostname[0] != '\0' && bi->backend_port > 0)
+    {
+        size_t hlen = strlen(bi->backend_hostname);
+        /* max port chars ~5, plus colon and NUL */
+        size_t out_len = hlen + 1 + 5 + 1;
+        char *out = palloc(out_len);
+        snprintf(out, out_len, "%s:%d", bi->backend_hostname, bi->backend_port);
+        return out;
+    }
+
+    /* Fallback to node<id> */
+    char *out = palloc(16);
+    snprintf(out, 16, "node%d", node_id);
+    return out;
+}
+
 static void
 CheckReplicationTimeLagErrorCb(void *arg)
 {
-	errcontext("while checking replication time lag");
+    errcontext("while checking replication time lag");
 }
 
 /*
-- 
2.51.0



  [application/octet-stream] 0002-test-add-comprehensive-test-suite-for-external-repli.patch (29.8K, 4-0002-test-add-comprehensive-test-suite-for-external-repli.patch)
  download | inline diff:
From a7f80ae699ce85e2958c060d0b96839b9269654a Mon Sep 17 00:00:00 2001
From: Nadav Shatz <[email protected]>
Date: Mon, 8 Sep 2025 10:34:25 +0300
Subject: [PATCH 2/3] test: add comprehensive test suite for external
 replication delay

- Add test suite for external replication delay feature
- Include parsing, validation, and integration tests
- Move tests to 041.* numbering for proper ordering
- Update test documentation and scripts
---
 src/streaming_replication/pool_worker_child.c | 113 +-----
 .../041.external_replication_delay/README     |  44 +++
 .../041.external_replication_delay/test.sh    | 352 ++++++++++++++++++
 .../test_parsing.sh                           |  55 +++
 .../test_validation.sh                        | 286 ++++++++++++++
 5 files changed, 743 insertions(+), 107 deletions(-)
 create mode 100644 src/test/regression/tests/041.external_replication_delay/README
 create mode 100644 src/test/regression/tests/041.external_replication_delay/test.sh
 create mode 100644 src/test/regression/tests/041.external_replication_delay/test_parsing.sh
 create mode 100644 src/test/regression/tests/041.external_replication_delay/test_validation.sh

diff --git a/src/streaming_replication/pool_worker_child.c b/src/streaming_replication/pool_worker_child.c
index d19bb1773..260989f27 100644
--- a/src/streaming_replication/pool_worker_child.c
+++ b/src/streaming_replication/pool_worker_child.c
@@ -77,8 +77,6 @@ static void establish_persistent_connection(void);
 static void discard_persistent_connection(void);
 static void check_replication_time_lag(void);
 static void check_replication_time_lag_with_cmd(void);
-static char *shell_single_quote(const char *src);
-static char *get_instance_identifier_for_node(int node_id);
 static void CheckReplicationTimeLagErrorCb(void *arg);
 static unsigned long long int text_to_lsn(char *text);
 static RETSIGTYPE my_signal_handler(int sig);
@@ -741,36 +739,12 @@ check_replication_time_lag_with_cmd(void)
 	error_context_stack = &callback;
 
 	/* Execute command as current process user */
-		PG_TRY();
-		{
-			const char *base_command = pool_config->replication_delay_source_cmd;
+	PG_TRY();
+	{
+		command = pool_config->replication_delay_source_cmd;
 
-			/* Build command with ordered instance identifiers as arguments */
-			size_t total_len = strlen(base_command) + 1; /* +1 for NUL */
-			for (int i = 0; i < NUM_BACKENDS; i++)
-			{
-				char *ident = get_instance_identifier_for_node(i);
-				char *q = shell_single_quote(ident);
-				total_len += 1 /* space */ + strlen(q);
-				pfree(ident);
-				pfree(q);
-			}
-
-			command = palloc(total_len);
-			command[0] = '\0';
-			strlcpy(command, base_command, total_len);
-			for (int i = 0; i < NUM_BACKENDS; i++)
-			{
-				char *ident = get_instance_identifier_for_node(i);
-				char *q = shell_single_quote(ident);
-				strlcat(command, " ", total_len);
-				strlcat(command, q, total_len);
-				pfree(ident);
-				pfree(q);
-			}
-
-			ereport(DEBUG1,
-					(errmsg("executing replication delay command: %s", command)));
+		ereport(DEBUG1,
+				(errmsg("executing replication delay command: %s", command)));
 
 		/* Set up timeout for command execution */
 		command_timeout_occurred = 0;
@@ -931,8 +905,6 @@ check_replication_time_lag_with_cmd(void)
 		}
 		if (line)
 			pfree(line);
-		if (command)
-			pfree(command);
 		error_context_stack = callback.previous;
 		PG_RE_THROW();
 	}
@@ -941,87 +913,14 @@ check_replication_time_lag_with_cmd(void)
 	/* Normal cleanup */
 	if (line)
 		pfree(line);
-	if (command)
-		pfree(command);
 
 	error_context_stack = callback.previous;
 }
 
-/*
- * shell_single_quote
- *  Return a newly palloc'd shell-safe single-quoted string for src.
- *  Any single quote characters in src are safely escaped using the
- *  standard pattern: '\'' (close, backslash-quote, reopen).
- */
-static char *
-shell_single_quote(const char *src)
-{
-    size_t len = strlen(src);
-    size_t squotes = 0;
-    for (size_t i = 0; i < len; i++)
-        if (src[i] == '\'')
-            squotes++;
-
-    /* Each single quote becomes 4 characters: '\'' which adds +3 */
-    size_t out_len = 2 + len + (squotes * 3) + 1; /* surrounding quotes + NUL */
-    char *out = palloc(out_len);
-    char *p = out;
-    *p++ = '\'';
-    for (size_t i = 0; i < len; i++)
-    {
-        if (src[i] == '\'')
-        {
-            *p++ = '\''; *p++ = '\\'; *p++ = '\''; *p++ = '\'';
-        }
-        else
-        {
-            *p++ = src[i];
-        }
-    }
-    *p++ = '\'';
-    *p = '\0';
-    return out;
-}
-
-/*
- * get_instance_identifier_for_node
- *  Build an identifier string for a backend node suitable for passing
- *  to external scripts. Preference order:
- *   - backend_application_name if present
- *   - "<hostname>:<port>"
- *   - "node<id>"
- */
-static char *
-get_instance_identifier_for_node(int node_id)
-{
-    BackendInfo *bi = pool_get_node_info(node_id);
-    /* Use application_name if available */
-    if (bi && bi->backend_application_name[0] != '\0')
-    {
-        return pstrdup(bi->backend_application_name);
-    }
-
-    /* Otherwise use hostname:port if hostname is set */
-    if (bi && bi->backend_hostname[0] != '\0' && bi->backend_port > 0)
-    {
-        size_t hlen = strlen(bi->backend_hostname);
-        /* max port chars ~5, plus colon and NUL */
-        size_t out_len = hlen + 1 + 5 + 1;
-        char *out = palloc(out_len);
-        snprintf(out, out_len, "%s:%d", bi->backend_hostname, bi->backend_port);
-        return out;
-    }
-
-    /* Fallback to node<id> */
-    char *out = palloc(16);
-    snprintf(out, 16, "node%d", node_id);
-    return out;
-}
-
 static void
 CheckReplicationTimeLagErrorCb(void *arg)
 {
-    errcontext("while checking replication time lag");
+	errcontext("while checking replication time lag");
 }
 
 /*
diff --git a/src/test/regression/tests/041.external_replication_delay/README b/src/test/regression/tests/041.external_replication_delay/README
new file mode 100644
index 000000000..237597e73
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/README
@@ -0,0 +1,44 @@
+External Replication Delay Command Test
+========================================
+
+This test verifies the external command replication delay source feature.
+
+Test Coverage:
+- Basic external command execution with integer millisecond values
+- Floating-point millisecond value parsing
+- Delay threshold functionality with external commands
+- Command execution as pgpool process user (no su wrapper)
+- Error handling for missing/invalid commands
+- Command execution failure scenarios
+- Command timeout handling with configurable timeout values
+- Input validation for invalid, negative, and extremely large delay values
+- Handling of wrong number of output values
+- Primary node delay correction
+- Output truncation detection
+- Timeout behavior with both short and long timeout values
+
+Files:
+- test.sh: Main test script
+- test_parsing.sh: Unit test for parsing logic
+- test_validation.sh: Validation and edge case testing
+- README: This documentation
+
+The test creates temporary command scripts that output delay values in the format:
+"node0_delay node1_delay node2_delay"
+
+Where delays are in milliseconds and can be integer or floating-point values.
+
+Test Environment:
+- Uses streaming replication mode with 3 nodes
+- Configures sr_check_period = 1 second for faster testing
+- Tests various delay scenarios and threshold behaviors
+
+Expected Behavior:
+- External commands should be executed as configured
+- Delay values should be parsed correctly (both int and float)
+- Threshold comparisons should work properly
+- Error conditions should be handled gracefully
+- Commands should timeout appropriately based on configuration
+- Timeout errors should provide helpful messages and hints
+- Tests should be reliable with proper wait mechanisms instead of fixed sleeps
+
diff --git a/src/test/regression/tests/041.external_replication_delay/test.sh b/src/test/regression/tests/041.external_replication_delay/test.sh
new file mode 100644
index 000000000..5dc010494
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/test.sh
@@ -0,0 +1,352 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for external command replication delay source
+#
+source $TESTLIBS
+TESTDIR=testdir
+PG_CTL=$PGBIN/pg_ctl
+PSQL="$PGBIN/psql -X "
+
+rm -fr $TESTDIR
+mkdir $TESTDIR
+cd $TESTDIR
+
+# create test environmen
+echo -n "creating test environment..."
+$PGPOOL_SETUP -m s -n 3 || exit 1
+echo "done."
+source ./bashrc.ports
+export PGPORT=$PGPOOL_PORT
+
+# Create external command scripts for testing
+cat > delay_cmd_static.sh << 'EOF'
+#!/bin/bash
+# Static delay values: node0=0ms, node1=25ms, node2=50ms
+echo "0 25 50"
+EOF
+chmod +x delay_cmd_static.sh
+
+cat > delay_cmd_float.sh << 'EOF'
+#!/bin/bash
+# Float delay values: node0=0ms, node1=25.5ms, node2=100.75ms
+echo "0 25.5 100.75"
+EOF
+chmod +x delay_cmd_float.sh
+
+cat > delay_cmd_high.sh << 'EOF'
+#!/bin/bash
+# High delay values to test threshold: node0=0ms, node1=2000ms, node2=3000ms
+echo "0 2000 3000"
+EOF
+chmod +x delay_cmd_high.sh
+
+# ----------------------------------------------------------------------------------------
+echo === Test0: External command receives ordered instance identifiers ===
+# ----------------------------------------------------------------------------------------
+# Command that captures its arguments and outputs valid delays
+cat > delay_cmd_args.sh << 'EOF'
+#!/bin/bash
+printf "%s " "$@" > args.txt
+echo "0 25 50"
+EOF
+chmod +x delay_cmd_args.sh
+
+echo "replication_delay_source = 'cmd'" >> etc/pgpool.conf
+echo "replication_delay_source_cmd = './delay_cmd_args.sh'" >> etc/pgpool.conf
+echo "sr_check_period = 1" >> etc/pgpool.conf
+echo "log_min_messages = 'DEBUG1'" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+echo "Waiting for sr_check to pass args..."
+for i in {1..10}; do
+    if [ -f args.txt ]; then
+        break
+    fi
+    sleep 1
+done
+
+if [ ! -f args.txt ]; then
+    echo fail: did not capture command arguments
+    ./shutdownall
+    exit 1
+fi
+
+ARGS_CONTENT=$(cat args.txt | sed 's/[[:space:]]*$//')
+if [ "$ARGS_CONTENT" != "server0 server1 server2" ]; then
+    echo "fail: unexpected command arguments: '$ARGS_CONTENT'"
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: argument order and values are correct
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test1: Basic external command with integer millisecond values ===
+# ----------------------------------------------------------------------------------------
+echo "replication_delay_source = 'cmd'" >> etc/pgpool.conf
+echo "replication_delay_source_cmd = './delay_cmd_static.sh'" >> etc/pgpool.conf
+echo "sr_check_period = 1" >> etc/pgpool.conf
+echo "log_standby_delay = 'always'" >> etc/pgpool.conf
+echo "log_min_messages = 'DEBUG1'" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+$PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+EOF
+
+# Wait for sr_check to run and populate delay values
+# sr_check_period is 1 second, so wait a bit longer to ensure it runs
+echo "Waiting for sr_check to run..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command" log/pgpool.log 2>/dev/null; then
+        echo "Command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SHOW POOL_NODES;
+EOF
+
+# Check that delay values are populated in the log
+grep "executing replication delay command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: external command was not executed
+    echo "Log contents:"
+    tail -20 log/pgpool.log
+    ./shutdownall
+    exit 1
+fi
+
+# Verify actual delay values were parsed
+if ! $PSQL -t -c "SHOW POOL_NODES" test | grep -E "[0-9]+\.[0-9]+" >/dev/null; then
+    echo "Warning: No delay values found in POOL_NODES output"
+fi
+
+# Check for delay log messages
+grep "Replication of node.*external command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: external command delay logging not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: basic external command test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test2: External command with floating-point millisecond values ===
+# ----------------------------------------------------------------------------------------
+# Update configuration to use float command
+sed -i.bak "s|delay_cmd_static.sh|delay_cmd_float.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run with float values
+echo "Waiting for sr_check with float values..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*delay_cmd_float.sh" log/pgpool.log 2>/dev/null; then
+        echo "Float command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SHOW POOL_NODES;
+EOF
+
+# Check that float values are handled correctly
+grep "executing replication delay command.*delay_cmd_float.sh" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: float command was not executed
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: floating-point values test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test3: External command with delay threshold ===
+# ----------------------------------------------------------------------------------------
+# Update configuration to use high delay command and set threshold
+sed -i.bak "s|delay_cmd_float.sh|delay_cmd_high.sh|" etc/pgpool.conf
+echo "delay_threshold_by_time = 1000" >> etc/pgpool.conf
+echo "backend_weight0 = 0" >> etc/pgpool.conf  # Force queries to standby normally
+echo "backend_weight2 = 0" >> etc/pgpool.conf  # Only use node 1 as standby
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run and detect high delays
+echo "Waiting for sr_check with high delay values..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*delay_cmd_high.sh" log/pgpool.log 2>/dev/null; then
+        echo "High delay command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SELECT * FROM t1 LIMIT 1;
+EOF
+
+# With high delays (2000ms > 1000ms threshold), query should go to primary (node 0)
+grep "SELECT \* FROM t1 LIMIT 1.*DB node id: 0" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: query was not sent to primary node despite high delay
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: delay threshold test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test4: External command execution as process user ===
+# ----------------------------------------------------------------------------------------
+# Test that command runs as the current pgpool process user
+sed -i.bak "s|delay_cmd_high.sh|delay_cmd_static.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for sr_check to run as process user..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*delay_cmd_static.sh" log/pgpool.log 2>/dev/null; then
+        echo "Command executed as process user after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check that command was executed (without su wrapper)
+grep "executing replication delay command.*delay_cmd_static.sh" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: command was not executed as process user
+    ./shutdownall
+    exit 1
+fi
+
+# Verify no su command was used
+if grep -q "executing replication delay command.*su.*" log/pgpool.log 2>/dev/null; then
+    echo fail: command should not use su wrapper
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: process user execution test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test5: Error handling - missing command ===
+# ----------------------------------------------------------------------------------------
+# Test error handling when command is not configured
+sed -i.bak "s|replication_delay_source_cmd = './delay_cmd_static.sh'|replication_delay_source_cmd = ''|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run with missing command
+echo "Waiting for sr_check with missing command..."
+for i in {1..5}; do
+    if grep -q "replication_delay_source_cmd is not configured" log/pgpool.log 2>/dev/null; then
+        echo "Missing command error detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for error message about missing command
+grep "replication_delay_source_cmd is not configured" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: missing command error not detected
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: error handling test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test6: Error handling - command execution failure ===
+# ----------------------------------------------------------------------------------------
+# Test error handling when command fails
+echo "replication_delay_source_cmd = './nonexistent_command.sh'" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run with failing command
+echo "Waiting for sr_check with failing command..."
+for i in {1..5}; do
+    if grep -q "failed to execute replication delay command" log/pgpool.log 2>/dev/null; then
+        echo "Command failure detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for error message about command execution failure
+grep "failed to execute replication delay command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: command execution failure not detected
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: command failure test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test7: Command timeout handling ===
+# ----------------------------------------------------------------------------------------
+# Create a command that takes longer than the timeou
+cat > delay_cmd_slow.sh << 'EOF'
+#!/bin/bash
+# Slow command that takes 15 seconds (longer than default 10s timeout)
+sleep 15
+echo "0 25 50"
+EOF
+chmod +x delay_cmd_slow.sh
+
+# Set a short timeout and use the slow command
+sed -i.bak "s|replication_delay_source_cmd = './nonexistent_command.sh'|replication_delay_source_cmd = './delay_cmd_slow.sh'|" etc/pgpool.conf
+echo "replication_delay_source_timeout = 3" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run and timeou
+echo "Waiting for command timeout..."
+for i in {1..15}; do
+    if grep -q "replication delay command timed out" log/pgpool.log 2>/dev/null; then
+        echo "Command timeout detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for timeout error message
+grep "replication delay command timed out after 3 seconds" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: command timeout not detected
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: command timeout test succeeded
+./shutdownall
+
+echo "All external replication delay tests passed!"
+exit 0
diff --git a/src/test/regression/tests/041.external_replication_delay/test_parsing.sh b/src/test/regression/tests/041.external_replication_delay/test_parsing.sh
new file mode 100644
index 000000000..d024ce559
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/test_parsing.sh
@@ -0,0 +1,55 @@
+#!/bin/bash
+#-------------------------------------------------------------------
+# Unit test for external command parsing logic
+# This tests the parsing without needing a full pgpool setup
+#
+
+echo "=== Testing external command output parsing ==="
+
+# Test 1: Integer values
+echo "Test 1: Integer millisecond values"
+echo "0 25 50" > test_output.txt
+echo "Expected: 0ms, 25ms, 50ms"
+echo "Output: $(cat test_output.txt)"
+echo ""
+
+# Test 2: Float values
+echo "Test 2: Floating-point millisecond values"
+echo "0 25.5 100.75" > test_output_float.txt
+echo "Expected: 0ms, 25.5ms, 100.75ms"
+echo "Output: $(cat test_output_float.txt)"
+echo ""
+
+# Test 3: High precision float values
+echo "Test 3: High precision values"
+echo "0 0.001 999.999" > test_output_precision.txt
+echo "Expected: 0ms, 0.001ms, 999.999ms"
+echo "Output: $(cat test_output_precision.txt)"
+echo ""
+
+# Test 4: Edge case - zero values
+echo "Test 4: All zero values"
+echo "0 0 0" > test_output_zeros.txt
+echo "Expected: 0ms, 0ms, 0ms"
+echo "Output: $(cat test_output_zeros.txt)"
+echo ""
+
+# Test 5: Edge case - large values
+echo "Test 5: Large delay values"
+echo "0 5000 10000" > test_output_large.txt
+echo "Expected: 0ms, 5000ms, 10000ms"
+echo "Output: $(cat test_output_large.txt)"
+echo ""
+
+# Test 6: Mixed integer and float values
+echo "Test 6: Mixed integer and float values"
+echo "0 25 50.5" > test_output_mixed.txt
+echo "Expected: 0ms, 25ms, 50.5ms"
+echo "Output: $(cat test_output_mixed.txt)"
+echo ""
+
+# Cleanup
+rm -f test_output_*.txt
+
+echo "All parsing tests completed. These outputs should be parseable by the external command feature."
+
diff --git a/src/test/regression/tests/041.external_replication_delay/test_validation.sh b/src/test/regression/tests/041.external_replication_delay/test_validation.sh
new file mode 100644
index 000000000..2ea8e32b8
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/test_validation.sh
@@ -0,0 +1,286 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for external command validation and edge cases
+#
+source $TESTLIBS
+TESTDIR=testdir_validation
+PG_CTL=$PGBIN/pg_ctl
+PSQL="$PGBIN/psql -X "
+
+rm -fr $TESTDIR
+mkdir $TESTDIR
+cd $TESTDIR
+
+# create test environmen
+echo -n "creating test environment..."
+$PGPOOL_SETUP -m s -n 3 || exit 1
+echo "done."
+source ./bashrc.ports
+export PGPORT=$PGPOOL_PORT
+
+# Create test command scripts
+cat > delay_cmd_validation.sh << 'EOF'
+#!/bin/bash
+# Test validation: output with invalid values
+echo "0 invalid_value 50.5"
+EOF
+chmod +x delay_cmd_validation.sh
+
+cat > delay_cmd_negative.sh << 'EOF'
+#!/bin/bash
+# Test negative values
+echo "0 -25 50"
+EOF
+chmod +x delay_cmd_negative.sh
+
+cat > delay_cmd_large.sh << 'EOF'
+#!/bin/bash
+# Test extremely large values
+echo "0 9999999 50"
+EOF
+chmod +x delay_cmd_large.sh
+
+cat > delay_cmd_wrong_count.sh << 'EOF'
+#!/bin/bash
+# Test wrong number of values (only 2 instead of 3)
+echo "0 25"
+EOF
+chmod +x delay_cmd_wrong_count.sh
+
+cat > delay_cmd_truncated.sh << 'EOF'
+#!/bin/bash
+# Test output that might be truncated (very long line)
+printf "0 25 "
+for i in {1..1000}; do printf "very_long_output_"; done
+echo "50"
+EOF
+chmod +x delay_cmd_truncated.sh
+
+# ----------------------------------------------------------------------------------------
+echo === Test1: Validation of invalid delay values ===
+# ----------------------------------------------------------------------------------------
+echo "replication_delay_source = 'cmd'" >> etc/pgpool.conf
+echo "replication_delay_source_cmd = './delay_cmd_validation.sh'" >> etc/pgpool.conf
+echo "sr_check_period = 1" >> etc/pgpool.conf
+echo "log_standby_delay = 'always'" >> etc/pgpool.conf
+echo "log_min_messages = 'DEBUG1'" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+$PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+EOF
+
+# Wait for sr_check to run
+echo "Waiting for validation test..."
+for i in {1..10}; do
+    if grep -q "invalid delay value" log/pgpool.log 2>/dev/null; then
+        echo "Validation error detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for validation warning
+grep "invalid delay value 'invalid_value' for node" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: validation warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: invalid value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test2: Negative delay values ===
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_validation.sh|delay_cmd_negative.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for negative value test..."
+for i in {1..10}; do
+    if grep -q "negative delay value" log/pgpool.log 2>/dev/null; then
+        echo "Negative value warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for negative value warning
+grep "negative delay value.*for node" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: negative value warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: negative value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test3: Extremely large delay values ===
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_negative.sh|delay_cmd_large.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for large value test..."
+for i in {1..10}; do
+    if grep -q "extremely large delay value" log/pgpool.log 2>/dev/null; then
+        echo "Large value warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for large value warning
+grep "extremely large delay value.*for node" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: large value warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: large value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test4: Wrong number of output values ===
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_large.sh|delay_cmd_wrong_count.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for wrong count test..."
+for i in {1..10}; do
+    if grep -q "returned.*values, expected" log/pgpool.log 2>/dev/null; then
+        echo "Wrong count warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for wrong count warning
+grep "returned.*values, expected.*Command should output one delay value per backend node" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: wrong count validation test not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: wrong count validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test5: Primary node non-zero delay handling ===
+# ----------------------------------------------------------------------------------------
+cat > delay_cmd_primary_nonzero.sh << 'EOF'
+#!/bin/bash
+# Test primary node with non-zero delay (should be corrected to 0)
+echo "100 25 50"
+EOF
+chmod +x delay_cmd_primary_nonzero.sh
+
+sed -i.bak "s|delay_cmd_wrong_count.sh|delay_cmd_primary_nonzero.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for primary non-zero delay test..."
+for i in {1..10}; do
+    if grep -q "primary node.*reported non-zero delay" log/pgpool.log 2>/dev/null; then
+        echo "Primary non-zero delay detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for primary node correction
+grep "primary node.*reported non-zero delay.*setting to 0" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: primary node delay correction not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: primary node delay correction test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test6: Command timeout with different timeout values ===
+# ----------------------------------------------------------------------------------------
+cat > delay_cmd_timeout.sh << 'EOF'
+#!/bin/bash
+# Command that takes 5 seconds
+sleep 5
+echo "0 25 50"
+EOF
+chmod +x delay_cmd_timeout.sh
+
+# Test with timeout shorter than command duration
+sed -i.bak "s|delay_cmd_primary_nonzero.sh|delay_cmd_timeout.sh|" etc/pgpool.conf
+echo "replication_delay_source_timeout = 2" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for timeou
+echo "Waiting for timeout test (2s timeout, 5s command)..."
+for i in {1..10}; do
+    if grep -q "replication delay command timed out after 2 seconds" log/pgpool.log 2>/dev/null; then
+        echo "Timeout detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for timeout message
+grep "replication delay command timed out after 2 seconds" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: timeout not detected
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: timeout test succeeded
+./shutdownall
+
+# Test with timeout longer than command duration
+sed -i.bak "s|replication_delay_source_timeout = 2|replication_delay_source_timeout = 10|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for successful execution
+echo "Waiting for successful execution (10s timeout, 5s command)..."
+for i in {1..15}; do
+    if grep -q "executing replication delay command.*delay_cmd_timeout.sh" log/pgpool.log 2>/dev/null; then
+        echo "Command executed successfully after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Should not timeout this time
+if grep -q "replication delay command timed out" log/pgpool.log 2>/dev/null; then
+    echo fail: command should not have timed out with 10s timeou
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: extended timeout test succeeded
+./shutdownall
+
+echo "All validation tests passed!"
+exit 0
+
-- 
2.51.0



  [application/octet-stream] 0003-doc-document-external-replication-delay-command-and-.patch (4.7K, 5-0003-doc-document-external-replication-delay-command-and-.patch)
  download | inline diff:
From 9b809157967e1498ea690e50cb4a9a39ff4a07a8 Mon Sep 17 00:00:00 2001
From: Nadav Shatz <[email protected]>
Date: Mon, 8 Sep 2025 09:58:31 +0300
Subject: [PATCH 3/3] doc: document external replication delay command and
 arguments

Add English documentation for replication_delay_source ('builtin'|'cmd'),
replication_delay_source_cmd including positional instance identifier arguments
passed in Pgpool backend order, and replication_delay_source_timeout.
---
 doc/src/sgml/stream-check.sgml | 90 ++++++++++++++++++++++++++++++++++
 1 file changed, 90 insertions(+)

diff --git a/doc/src/sgml/stream-check.sgml b/doc/src/sgml/stream-check.sgml
index d2ca3ca49..12e745e76 100644
--- a/doc/src/sgml/stream-check.sgml
+++ b/doc/src/sgml/stream-check.sgml
@@ -309,6 +309,96 @@ GRANT pg_monitor TO sr_check_user;
     </listitem>
   </varlistentry>
 
+  <varlistentry id="guc-replication-delay-source" xreflabel="replication_delay_source">
+   <term><varname>replication_delay_source</varname> (<type>enum</type>)
+    <indexterm>
+     <primary><varname>replication_delay_source</varname> configuration parameter</primary>
+    </indexterm>
+   </term>
+   <listitem>
+    <para>
+     Specifies the source of replication delay information used by the
+     streaming replication delay check worker. Valid values are:
+    </para>
+    <itemizedlist>
+     <listitem>
+      <para><literal>builtin</literal> — query the primary and standbys to compute delay.</para>
+     </listitem>
+     <listitem>
+      <para><literal>cmd</literal> — run an external command to obtain delays for each backend.</para>
+     </listitem>
+    </itemizedlist>
+    <para>
+     When set to <literal>cmd</literal>, <xref linkend="guc-replication-delay-source-cmd"> must be set
+     to the command to execute, and <xref linkend="guc-replication-delay-source-timeout"> controls
+     its timeout.
+    </para>
+    <para>
+     This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+    </para>
+   </listitem>
+  </varlistentry>
+
+  <varlistentry id="guc-replication-delay-source-cmd" xreflabel="replication_delay_source_cmd">
+   <term><varname>replication_delay_source_cmd</varname> (<type>string</type>)
+    <indexterm>
+     <primary><varname>replication_delay_source_cmd</varname> configuration parameter</primary>
+    </indexterm>
+   </term>
+   <listitem>
+    <para>
+     Specifies the external command to execute when <xref linkend="guc-replication-delay-source">
+     is set to <literal>cmd</literal>. The command is executed as the
+     <productname>Pgpool-II</productname> process user.
+    </para>
+    <para>
+     The command receives an ordered list of instance identifiers as positional
+     arguments corresponding to Pgpool backend node indexes (0..N-1). The order
+     matches Pgpool's backend order so the script can map metrics (for example
+     from AWS CloudWatch for Aurora) back to the correct node. Each identifier is
+     one of:
+    </para>
+    <itemizedlist>
+     <listitem>
+      <para><literal>backend_application_name</literal> (if configured)</para>
+     </listitem>
+     <listitem>
+      <para><literal>&lt;hostname&gt;:&lt;port&gt;</literal> (if application name is empty)</para>
+     </listitem>
+     <listitem>
+      <para><literal>node&lt;i&gt;</literal> (fallback)</para>
+     </listitem>
+    </itemizedlist>
+    <para>
+     The command must write a single line to stdout containing one whitespace-separated
+     delay value per backend, in milliseconds, in the same order as the arguments.
+     For example: <literal>"0 25.5 100"</literal> for a 3-node cluster.
+    </para>
+    <para>
+     This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+    </para>
+   </listitem>
+  </varlistentry>
+
+  <varlistentry id="guc-replication-delay-source-timeout" xreflabel="replication_delay_source_timeout">
+   <term><varname>replication_delay_source_timeout</varname> (<type>integer</type>)
+    <indexterm>
+     <primary><varname>replication_delay_source_timeout</varname> configuration parameter</primary>
+    </indexterm>
+   </term>
+   <listitem>
+    <para>
+     Specifies the timeout in seconds for the external command executed when
+     <xref linkend="guc-replication-delay-source"> is set to <literal>cmd</literal>.
+     If the command does not finish within the timeout, Pgpool logs an error and
+     continues.
+    </para>
+    <para>
+     This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+    </para>
+   </listitem>
+  </varlistentry>
+
   <varlistentry id="guc-log-standby-delay" xreflabel="log_standby_delay">
    <term><varname>log_standby_delay</varname> (<type>enum</type>)
     <indexterm>
-- 
2.51.0



^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-09-08 12:02  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  1 sibling, 0 replies; 61+ messages in thread

From: Tatsuo Ishii @ 2025-09-08 12:02 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

Hi Nadav,

Wow, that's quick!
I will look into the patch tomorrow.

> Hi Tatsuo,
> 
> Please find attached the 3 patch files (implementation, tests, docs) with
> the updates we discussed.
> 
> What do you think?
> 
> Best,
> 
> On Mon, Sep 8, 2025 at 3:26 AM Tatsuo Ishii <[email protected]> wrote:
> 
>> Hi Nadav,
>>
>> > Hi Tatsuo,
>> >
>> > Thanks for getting back to me. Let me clarify the ordering concern and
>> > provide an example to make it clearer:
>> >
>> > Currently, replication_delay_source_cmd executes without awareness of the
>> > replica list or the order in which Pgpool loads them. For Aurora, since
>> > we’re bypassing the internal DB tables and fetching lag data directly via
>> > the AWS CloudWatch API, we need to ensure the returned lag values are
>> > mapped to the correct instances.
>> >
>> > For example, assume Pgpool has the following configuration:
>> >
>> > primary: db-primary
>> > replicas: db-replica-a, db-replica-b, db-replica-c
>> >
>> > If the command retrieves lag values [15, 120, 60] from CloudWatch, we
>> need
>> > to guarantee these are consistently mapped as:
>> >
>> >
>> >    -
>> >
>> >    db-replica-a → 15ms
>> >    -
>> >
>> >    db-replica-b → 120ms
>> >    -
>> >
>> >    db-replica-c → 60ms
>> >
>> > Without explicitly passing the instance identifiers and their order to
>> the
>> > command, there’s a risk that mismatched ordering will cause Pgpool to
>> make
>> > incorrect routing decisions.
>> >
>> > To address this, I suggest extending replication_delay_source_cmd to
>> accept
>> > an ordered list of instance identifiers as arguments. This way, the
>> command
>> > can fetch the metrics in the same sequence Pgpool expects, ensuring
>> > alignment between configuration and returned data.
>>
>> Thanks for the clarification. Previously I misunderstood that Aurora
>> only provides "reader endpoint", which made me think your proposal to
>> be impossible. But after some research , I found that Aurora also
>> provides "cluster endpoint" which refers to each replica instance.  So
>> let me check if my understanding is
>> correct. replication_delay_source_cmd will be invoked as:
>>
>> replication_delay_source_cmd db-replica-a db-replica-b db-replica-c
>>
>> > Would you agree this approach makes sense?
>>
>> Yes.
>>
>> > If so, I can provide an updated
>> > patch to demonstrate how the command would handle ordered instance
>> mapping.
>>
>> Thanks. That would be good.
>>
>> BTW, There are minor points regarding your previous patch. In the patch
>>
>> 083.external_replication_delay/
>>
>> is the test directory. This does not fit in with our test
>> infrastructure tradition. Tests for new features should be added
>> between 001 and 049. 050 and greater are reserved for tests for bug
>> fixes. So at this point, 041 is appropreate (if other test for a new
>> feature is added before your patch is committed, you need to adjust
>> the number of course).
>>
>> You need to include a patch for documentation. You don't need to write
>> Japanese doc (doc.ja). We will create it from the English document
>> later on.
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
> 
> 
> -- 
> Nadav Shatz
> Tailor Brands | CTO


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-09-09 00:39  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  1 sibling, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2025-09-09 00:39 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

Hi Nadav,

> Hi Tatsuo,
> 
> Please find attached the 3 patch files (implementation, tests, docs) with
> the updates we discussed.
> 
> What do you think?

I haven't read the code details yet but I have a few questions.

1) Can we use only replication_delay_source_cmd and if it's value is
   'builtin', then we treat it as replication_delay_source = builtin?
   Maybe this is matter of taste but I would like to know your
   opinion.

2) replication_delay_source_cmd will be given an ordered list of
   instance identifiers. But it seems there's no way for the command
   which one is the primary instance. Is it okay for the command?

3) Why do you have 3 kind of instance identifiers (application name,
   hostname (IP) + port and node id? I thought "hostname (IP) + port"
   is sufficient.

Comments?
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp





^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-09-15 12:48  Nadav Shatz <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Nadav Shatz @ 2025-09-15 12:48 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

Hi Tatsuo,

Sorry for the late reply - I'm traveling with my family at the moment (in
Japan actually) and might be delayed in responding.

Re your points:
1 - we can, but I have to say that a user I tend to prefer configuration
values not have a "magic" value that does something different than the
usual case like this would create. I'd stick with what we already have
planned. happy to hear from others on the mailing list as well of course.

2 - I think we can have the primary always be the first or we can
completely remove it since it might be redundant as it's always going to be
0. what do you think?

3 - I agree with you, next version (after we clear everything else) will
have only ip/hostname+port.

Let me know your thoughts

Thanks!

On Tue, Sep 9, 2025 at 9:42 AM Tatsuo Ishii <[email protected]> wrote:

> Hi Nadav,
>
> > Hi Tatsuo,
> >
> > Please find attached the 3 patch files (implementation, tests, docs) with
> > the updates we discussed.
> >
> > What do you think?
>
> I haven't read the code details yet but I have a few questions.
>
> 1) Can we use only replication_delay_source_cmd and if it's value is
>    'builtin', then we treat it as replication_delay_source = builtin?
>    Maybe this is matter of taste but I would like to know your
>    opinion.
>
> 2) replication_delay_source_cmd will be given an ordered list of
>    instance identifiers. But it seems there's no way for the command
>    which one is the primary instance. Is it okay for the command?
>
> 3) Why do you have 3 kind of instance identifiers (application name,
>    hostname (IP) + port and node id? I thought "hostname (IP) + port"
>    is sufficient.
>
> Comments?
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>


-- 
Nadav Shatz
Tailor Brands | CTO


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-09-16 10:30  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2025-09-16 10:30 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

> Hi Tatsuo,
> 
> Sorry for the late reply - I'm traveling with my family at the moment (in
> Japan actually) 

Excellent! Hope you and your family are spending great time in Japan.

> and might be delayed in responding.

No problem at all. I think you should focus on the travel at this
moment.

> Re your points:
> 1 - we can, but I have to say that a user I tend to prefer configuration
> values not have a "magic" value that does something different than the
> usual case like this would create. I'd stick with what we already have
> planned. happy to hear from others on the mailing list as well of course.

Makes sense. I withdraw my proposal.

> 2 - I think we can have the primary always be the first or we can
> completely remove it since it might be redundant as it's always going to be
> 0. what do you think?

What I am not sure is, whether we can assume the command always knows
which host (or IP) is primary? If the answer is yes, then we could
omit the primary. What do you think?

> 3 - I agree with you, next version (after we clear everything else) will
> have only ip/hostname+port.

Thank you for understanding.

> Let me know your thoughts
> 
> Thanks!
> 
> On Tue, Sep 9, 2025 at 9:42 AM Tatsuo Ishii <[email protected]> wrote:
> 
>> Hi Nadav,
>>
>> > Hi Tatsuo,
>> >
>> > Please find attached the 3 patch files (implementation, tests, docs) with
>> > the updates we discussed.
>> >
>> > What do you think?
>>
>> I haven't read the code details yet but I have a few questions.
>>
>> 1) Can we use only replication_delay_source_cmd and if it's value is
>>    'builtin', then we treat it as replication_delay_source = builtin?
>>    Maybe this is matter of taste but I would like to know your
>>    opinion.
>>
>> 2) replication_delay_source_cmd will be given an ordered list of
>>    instance identifiers. But it seems there's no way for the command
>>    which one is the primary instance. Is it okay for the command?
>>
>> 3) Why do you have 3 kind of instance identifiers (application name,
>>    hostname (IP) + port and node id? I thought "hostname (IP) + port"
>>    is sufficient.
>>
>> Comments?
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
> 
> 
> -- 
> Nadav Shatz
> Tailor Brands | CTO


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-09-20 23:57  Nadav Shatz <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Nadav Shatz @ 2025-09-20 23:57 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

Thank you for the kind words. We are having a great time!

Regarding the command knowing about the primary I think it is safe to assume. We can start this way and evolve in the future if needed. I can include a note about it in the notes that the command will only receive the secondary instances as arguments.

Anything else that comes to mind?
 
Nadav Shatz
CTO

> On Sep 16, 2025, at 7:30 PM, Tatsuo Ishii <[email protected]> wrote:
> 
> 
>> 
>> Hi Tatsuo,
>> 
>> Sorry for the late reply - I'm traveling with my family at the moment (in
>> Japan actually)
> 
> Excellent! Hope you and your family are spending great time in Japan.
> 
>> and might be delayed in responding.
> 
> No problem at all. I think you should focus on the travel at this
> moment.
> 
>> Re your points:
>> 1 - we can, but I have to say that a user I tend to prefer configuration
>> values not have a "magic" value that does something different than the
>> usual case like this would create. I'd stick with what we already have
>> planned. happy to hear from others on the mailing list as well of course.
> 
> Makes sense. I withdraw my proposal.
> 
>> 2 - I think we can have the primary always be the first or we can
>> completely remove it since it might be redundant as it's always going to be
>> 0. what do you think?
> 
> What I am not sure is, whether we can assume the command always knows
> which host (or IP) is primary? If the answer is yes, then we could
> omit the primary. What do you think?
> 
>> 3 - I agree with you, next version (after we clear everything else) will
>> have only ip/hostname+port.
> 
> Thank you for understanding.
> 
>> Let me know your thoughts
>> 
>> Thanks!
>> 
>>> On Tue, Sep 9, 2025 at 9:42 AM Tatsuo Ishii <[email protected]> wrote:
>>> 
>>> Hi Nadav,
>>> 
>>>> Hi Tatsuo,
>>>> 
>>>> Please find attached the 3 patch files (implementation, tests, docs) with
>>>> the updates we discussed.
>>>> 
>>>> What do you think?
>>> 
>>> I haven't read the code details yet but I have a few questions.
>>> 
>>> 1) Can we use only replication_delay_source_cmd and if it's value is
>>>   'builtin', then we treat it as replication_delay_source = builtin?
>>>   Maybe this is matter of taste but I would like to know your
>>>   opinion.
>>> 
>>> 2) replication_delay_source_cmd will be given an ordered list of
>>>   instance identifiers. But it seems there's no way for the command
>>>   which one is the primary instance. Is it okay for the command?
>>> 
>>> 3) Why do you have 3 kind of instance identifiers (application name,
>>>   hostname (IP) + port and node id? I thought "hostname (IP) + port"
>>>   is sufficient.
>>> 
>>> Comments?
>>> --
>>> Tatsuo Ishii
>>> SRA OSS K.K.
>>> English: http://www.sraoss.co.jp/index_en/
>>> Japanese:http://www.sraoss.co.jp
>>> 
>> 
>> 
>> --
>> Nadav Shatz
>> Tailor Brands | CTO





^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-09-21 22:34  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2025-09-21 22:34 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

> Thank you for the kind words. We are having a great time!

Glad to hear that!

> Regarding the command knowing about the primary I think it is safe to assume.

Okay.

> We can start this way and evolve in the future if needed.

Agreed.

> I can include a note about it in the notes that the command will only receive the secondary instances as arguments.
> 
> Anything else that comes to mind?

Sounds like a reasonable requirement. Also the command excludes any
instance which is in down state?

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

> Nadav Shatz
> CTO
> 
>> On Sep 16, 2025, at 7:30 PM, Tatsuo Ishii <[email protected]> wrote:
>> 
>> 
>>> 
>>> Hi Tatsuo,
>>> 
>>> Sorry for the late reply - I'm traveling with my family at the moment (in
>>> Japan actually)
>> 
>> Excellent! Hope you and your family are spending great time in Japan.
>> 
>>> and might be delayed in responding.
>> 
>> No problem at all. I think you should focus on the travel at this
>> moment.
>> 
>>> Re your points:
>>> 1 - we can, but I have to say that a user I tend to prefer configuration
>>> values not have a "magic" value that does something different than the
>>> usual case like this would create. I'd stick with what we already have
>>> planned. happy to hear from others on the mailing list as well of course.
>> 
>> Makes sense. I withdraw my proposal.
>> 
>>> 2 - I think we can have the primary always be the first or we can
>>> completely remove it since it might be redundant as it's always going to be
>>> 0. what do you think?
>> 
>> What I am not sure is, whether we can assume the command always knows
>> which host (or IP) is primary? If the answer is yes, then we could
>> omit the primary. What do you think?
>> 
>>> 3 - I agree with you, next version (after we clear everything else) will
>>> have only ip/hostname+port.
>> 
>> Thank you for understanding.
>> 
>>> Let me know your thoughts
>>> 
>>> Thanks!
>>> 
>>>> On Tue, Sep 9, 2025 at 9:42 AM Tatsuo Ishii <[email protected]> wrote:
>>>> 
>>>> Hi Nadav,
>>>> 
>>>>> Hi Tatsuo,
>>>>> 
>>>>> Please find attached the 3 patch files (implementation, tests, docs) with
>>>>> the updates we discussed.
>>>>> 
>>>>> What do you think?
>>>> 
>>>> I haven't read the code details yet but I have a few questions.
>>>> 
>>>> 1) Can we use only replication_delay_source_cmd and if it's value is
>>>>   'builtin', then we treat it as replication_delay_source = builtin?
>>>>   Maybe this is matter of taste but I would like to know your
>>>>   opinion.
>>>> 
>>>> 2) replication_delay_source_cmd will be given an ordered list of
>>>>   instance identifiers. But it seems there's no way for the command
>>>>   which one is the primary instance. Is it okay for the command?
>>>> 
>>>> 3) Why do you have 3 kind of instance identifiers (application name,
>>>>   hostname (IP) + port and node id? I thought "hostname (IP) + port"
>>>>   is sufficient.
>>>> 
>>>> Comments?
>>>> --
>>>> Tatsuo Ishii
>>>> SRA OSS K.K.
>>>> English: http://www.sraoss.co.jp/index_en/
>>>> Japanese:http://www.sraoss.co.jp
>>>> 
>>> 
>>> 
>>> --
>>> Nadav Shatz
>>> Tailor Brands | CTO
> 
> 


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-09-29 12:24  Nadav Shatz <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Nadav Shatz @ 2025-09-29 12:24 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

I would actually suggest including down state instances in case pgpool
isn’t aware yet. It can exclude them once it does.
For these cases maybe  -1 ?

Nadav Shatz
Tailor Brands | CTO


On Mon, Sep 22, 2025 at 7:34 AM Tatsuo Ishii <[email protected]> wrote:

> > Thank you for the kind words. We are having a great time!
>
> Glad to hear that!
>
> > Regarding the command knowing about the primary I think it is safe to
> assume.
>
> Okay.
>
> > We can start this way and evolve in the future if needed.
>
> Agreed.
>
> > I can include a note about it in the notes that the command will only
> receive the secondary instances as arguments.
> >
> > Anything else that comes to mind?
>
> Sounds like a reasonable requirement. Also the command excludes any
> instance which is in down state?
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
> > Nadav Shatz
> > CTO
> >
> >> On Sep 16, 2025, at 7:30 PM, Tatsuo Ishii <[email protected]> wrote:
> >>
> >> 
> >>>
> >>> Hi Tatsuo,
> >>>
> >>> Sorry for the late reply - I'm traveling with my family at the moment
> (in
> >>> Japan actually)
> >>
> >> Excellent! Hope you and your family are spending great time in Japan.
> >>
> >>> and might be delayed in responding.
> >>
> >> No problem at all. I think you should focus on the travel at this
> >> moment.
> >>
> >>> Re your points:
> >>> 1 - we can, but I have to say that a user I tend to prefer
> configuration
> >>> values not have a "magic" value that does something different than the
> >>> usual case like this would create. I'd stick with what we already have
> >>> planned. happy to hear from others on the mailing list as well of
> course.
> >>
> >> Makes sense. I withdraw my proposal.
> >>
> >>> 2 - I think we can have the primary always be the first or we can
> >>> completely remove it since it might be redundant as it's always going
> to be
> >>> 0. what do you think?
> >>
> >> What I am not sure is, whether we can assume the command always knows
> >> which host (or IP) is primary? If the answer is yes, then we could
> >> omit the primary. What do you think?
> >>
> >>> 3 - I agree with you, next version (after we clear everything else)
> will
> >>> have only ip/hostname+port.
> >>
> >> Thank you for understanding.
> >>
> >>> Let me know your thoughts
> >>>
> >>> Thanks!
> >>>
> >>>> On Tue, Sep 9, 2025 at 9:42 AM Tatsuo Ishii <[email protected]>
> wrote:
> >>>>
> >>>> Hi Nadav,
> >>>>
> >>>>> Hi Tatsuo,
> >>>>>
> >>>>> Please find attached the 3 patch files (implementation, tests, docs)
> with
> >>>>> the updates we discussed.
> >>>>>
> >>>>> What do you think?
> >>>>
> >>>> I haven't read the code details yet but I have a few questions.
> >>>>
> >>>> 1) Can we use only replication_delay_source_cmd and if it's value is
> >>>>   'builtin', then we treat it as replication_delay_source = builtin?
> >>>>   Maybe this is matter of taste but I would like to know your
> >>>>   opinion.
> >>>>
> >>>> 2) replication_delay_source_cmd will be given an ordered list of
> >>>>   instance identifiers. But it seems there's no way for the command
> >>>>   which one is the primary instance. Is it okay for the command?
> >>>>
> >>>> 3) Why do you have 3 kind of instance identifiers (application name,
> >>>>   hostname (IP) + port and node id? I thought "hostname (IP) + port"
> >>>>   is sufficient.
> >>>>
> >>>> Comments?
> >>>> --
> >>>> Tatsuo Ishii
> >>>> SRA OSS K.K.
> >>>> English: http://www.sraoss.co.jp/index_en/
> >>>> Japanese:http://www.sraoss.co.jp
> >>>>
> >>>
> >>>
> >>> --
> >>> Nadav Shatz
> >>> Tailor Brands | CTO
> >
> >
>


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-09-30 09:35  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2025-09-30 09:35 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

> I would actually suggest including down state instances in case pgpool
> isn’t aware yet. It can exclude them once it does.
> For these cases maybe  -1 ?

I don't think pgpool will triger failover even if
replication_delay_source_cmd returns -1 for such instance because
pgpool already has its own method to detect instance down (i.e. health
check) and method to avoid false positive
(i.e. failover_require_consensus).

Still for such instaces replication_delay_source_cmd returns -1 maybe
useful if it's logged for admins.

So I am Okay with the idea.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

> Nadav Shatz
> Tailor Brands | CTO
> 
> 
> On Mon, Sep 22, 2025 at 7:34 AM Tatsuo Ishii <[email protected]> wrote:
> 
>> > Thank you for the kind words. We are having a great time!
>>
>> Glad to hear that!
>>
>> > Regarding the command knowing about the primary I think it is safe to
>> assume.
>>
>> Okay.
>>
>> > We can start this way and evolve in the future if needed.
>>
>> Agreed.
>>
>> > I can include a note about it in the notes that the command will only
>> receive the secondary instances as arguments.
>> >
>> > Anything else that comes to mind?
>>
>> Sounds like a reasonable requirement. Also the command excludes any
>> instance which is in down state?
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
>> > Nadav Shatz
>> > CTO
>> >
>> >> On Sep 16, 2025, at 7:30 PM, Tatsuo Ishii <[email protected]> wrote:
>> >>
>> >> 
>> >>>
>> >>> Hi Tatsuo,
>> >>>
>> >>> Sorry for the late reply - I'm traveling with my family at the moment
>> (in
>> >>> Japan actually)
>> >>
>> >> Excellent! Hope you and your family are spending great time in Japan.
>> >>
>> >>> and might be delayed in responding.
>> >>
>> >> No problem at all. I think you should focus on the travel at this
>> >> moment.
>> >>
>> >>> Re your points:
>> >>> 1 - we can, but I have to say that a user I tend to prefer
>> configuration
>> >>> values not have a "magic" value that does something different than the
>> >>> usual case like this would create. I'd stick with what we already have
>> >>> planned. happy to hear from others on the mailing list as well of
>> course.
>> >>
>> >> Makes sense. I withdraw my proposal.
>> >>
>> >>> 2 - I think we can have the primary always be the first or we can
>> >>> completely remove it since it might be redundant as it's always going
>> to be
>> >>> 0. what do you think?
>> >>
>> >> What I am not sure is, whether we can assume the command always knows
>> >> which host (or IP) is primary? If the answer is yes, then we could
>> >> omit the primary. What do you think?
>> >>
>> >>> 3 - I agree with you, next version (after we clear everything else)
>> will
>> >>> have only ip/hostname+port.
>> >>
>> >> Thank you for understanding.
>> >>
>> >>> Let me know your thoughts
>> >>>
>> >>> Thanks!
>> >>>
>> >>>> On Tue, Sep 9, 2025 at 9:42 AM Tatsuo Ishii <[email protected]>
>> wrote:
>> >>>>
>> >>>> Hi Nadav,
>> >>>>
>> >>>>> Hi Tatsuo,
>> >>>>>
>> >>>>> Please find attached the 3 patch files (implementation, tests, docs)
>> with
>> >>>>> the updates we discussed.
>> >>>>>
>> >>>>> What do you think?
>> >>>>
>> >>>> I haven't read the code details yet but I have a few questions.
>> >>>>
>> >>>> 1) Can we use only replication_delay_source_cmd and if it's value is
>> >>>>   'builtin', then we treat it as replication_delay_source = builtin?
>> >>>>   Maybe this is matter of taste but I would like to know your
>> >>>>   opinion.
>> >>>>
>> >>>> 2) replication_delay_source_cmd will be given an ordered list of
>> >>>>   instance identifiers. But it seems there's no way for the command
>> >>>>   which one is the primary instance. Is it okay for the command?
>> >>>>
>> >>>> 3) Why do you have 3 kind of instance identifiers (application name,
>> >>>>   hostname (IP) + port and node id? I thought "hostname (IP) + port"
>> >>>>   is sufficient.
>> >>>>
>> >>>> Comments?
>> >>>> --
>> >>>> Tatsuo Ishii
>> >>>> SRA OSS K.K.
>> >>>> English: http://www.sraoss.co.jp/index_en/
>> >>>> Japanese:http://www.sraoss.co.jp
>> >>>>
>> >>>
>> >>>
>> >>> --
>> >>> Nadav Shatz
>> >>> Tailor Brands | CTO
>> >
>> >
>>


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-10-29 10:43  Nadav Shatz <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Nadav Shatz @ 2025-10-29 10:43 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

Hi,

I'm back at work - wdyt of this version?

(side note - Japan was incredible :))

On Tue, Sep 30, 2025 at 12:35 PM Tatsuo Ishii <[email protected]> wrote:

> > I would actually suggest including down state instances in case pgpool
> > isn’t aware yet. It can exclude them once it does.
> > For these cases maybe  -1 ?
>
> I don't think pgpool will triger failover even if
> replication_delay_source_cmd returns -1 for such instance because
> pgpool already has its own method to detect instance down (i.e. health
> check) and method to avoid false positive
> (i.e. failover_require_consensus).
>
> Still for such instaces replication_delay_source_cmd returns -1 maybe
> useful if it's logged for admins.
>
> So I am Okay with the idea.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
> > Nadav Shatz
> > Tailor Brands | CTO
> >
> >
> > On Mon, Sep 22, 2025 at 7:34 AM Tatsuo Ishii <[email protected]>
> wrote:
> >
> >> > Thank you for the kind words. We are having a great time!
> >>
> >> Glad to hear that!
> >>
> >> > Regarding the command knowing about the primary I think it is safe to
> >> assume.
> >>
> >> Okay.
> >>
> >> > We can start this way and evolve in the future if needed.
> >>
> >> Agreed.
> >>
> >> > I can include a note about it in the notes that the command will only
> >> receive the secondary instances as arguments.
> >> >
> >> > Anything else that comes to mind?
> >>
> >> Sounds like a reasonable requirement. Also the command excludes any
> >> instance which is in down state?
> >>
> >> Best regards,
> >> --
> >> Tatsuo Ishii
> >> SRA OSS K.K.
> >> English: http://www.sraoss.co.jp/index_en/
> >> Japanese:http://www.sraoss.co.jp
> >>
> >> > Nadav Shatz
> >> > CTO
> >> >
> >> >> On Sep 16, 2025, at 7:30 PM, Tatsuo Ishii <[email protected]>
> wrote:
> >> >>
> >> >> 
> >> >>>
> >> >>> Hi Tatsuo,
> >> >>>
> >> >>> Sorry for the late reply - I'm traveling with my family at the
> moment
> >> (in
> >> >>> Japan actually)
> >> >>
> >> >> Excellent! Hope you and your family are spending great time in Japan.
> >> >>
> >> >>> and might be delayed in responding.
> >> >>
> >> >> No problem at all. I think you should focus on the travel at this
> >> >> moment.
> >> >>
> >> >>> Re your points:
> >> >>> 1 - we can, but I have to say that a user I tend to prefer
> >> configuration
> >> >>> values not have a "magic" value that does something different than
> the
> >> >>> usual case like this would create. I'd stick with what we already
> have
> >> >>> planned. happy to hear from others on the mailing list as well of
> >> course.
> >> >>
> >> >> Makes sense. I withdraw my proposal.
> >> >>
> >> >>> 2 - I think we can have the primary always be the first or we can
> >> >>> completely remove it since it might be redundant as it's always
> going
> >> to be
> >> >>> 0. what do you think?
> >> >>
> >> >> What I am not sure is, whether we can assume the command always knows
> >> >> which host (or IP) is primary? If the answer is yes, then we could
> >> >> omit the primary. What do you think?
> >> >>
> >> >>> 3 - I agree with you, next version (after we clear everything else)
> >> will
> >> >>> have only ip/hostname+port.
> >> >>
> >> >> Thank you for understanding.
> >> >>
> >> >>> Let me know your thoughts
> >> >>>
> >> >>> Thanks!
> >> >>>
> >> >>>> On Tue, Sep 9, 2025 at 9:42 AM Tatsuo Ishii <[email protected]>
> >> wrote:
> >> >>>>
> >> >>>> Hi Nadav,
> >> >>>>
> >> >>>>> Hi Tatsuo,
> >> >>>>>
> >> >>>>> Please find attached the 3 patch files (implementation, tests,
> docs)
> >> with
> >> >>>>> the updates we discussed.
> >> >>>>>
> >> >>>>> What do you think?
> >> >>>>
> >> >>>> I haven't read the code details yet but I have a few questions.
> >> >>>>
> >> >>>> 1) Can we use only replication_delay_source_cmd and if it's value
> is
> >> >>>>   'builtin', then we treat it as replication_delay_source =
> builtin?
> >> >>>>   Maybe this is matter of taste but I would like to know your
> >> >>>>   opinion.
> >> >>>>
> >> >>>> 2) replication_delay_source_cmd will be given an ordered list of
> >> >>>>   instance identifiers. But it seems there's no way for the command
> >> >>>>   which one is the primary instance. Is it okay for the command?
> >> >>>>
> >> >>>> 3) Why do you have 3 kind of instance identifiers (application
> name,
> >> >>>>   hostname (IP) + port and node id? I thought "hostname (IP) +
> port"
> >> >>>>   is sufficient.
> >> >>>>
> >> >>>> Comments?
> >> >>>> --
> >> >>>> Tatsuo Ishii
> >> >>>> SRA OSS K.K.
> >> >>>> English: http://www.sraoss.co.jp/index_en/
> >> >>>> Japanese:http://www.sraoss.co.jp
> >> >>>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> Nadav Shatz
> >> >>> Tailor Brands | CTO
> >> >
> >> >
> >>
>


-- 
Nadav Shatz
Tailor Brands | CTO


Attachments:

  [application/octet-stream] 0001-external-replication-delay-implementation.patch (14.7K, 3-0001-external-replication-delay-implementation.patch)
  download | inline diff:
From 2565d66b441c82e1e4a70b146aa8bfbda71e1471 Mon Sep 17 00:00:00 2001
From: Nadav Shatz <[email protected]>
Date: Mon, 27 Oct 2025 16:22:31 +0200
Subject: [PATCH 1/2] feat: external replication delay injection via external
 command

Implementation:
- Add replication_delay_source_cmd configuration string option
- Remove replication_delay_source enum (no magic values)
- Command receives replica node identifiers in host:port format
- Primary node omitted from command arguments and output
- Handle -1 for down nodes (log without triggering failover)
- Command outputs one delay value (ms) per replica
- Falls back to builtin queries if command not configured
- Timeout handling with replication_delay_source_timeout
---
 src/config/pool_config_variables.c            |  21 ++
 src/include/pool_config.h                     |   3 +-
 src/sample/pgpool.conf.sample-stream          |  14 +
 src/streaming_replication/pool_worker_child.c | 340 +++++++++++++++++-
 4 files changed, 376 insertions(+), 2 deletions(-)

diff --git a/src/config/pool_config_variables.c b/src/config/pool_config_variables.c
index 5bbe46d3a..efac0d866 100644
--- a/src/config/pool_config_variables.c
+++ b/src/config/pool_config_variables.c
@@ -980,6 +980,16 @@ static struct config_string ConfigureNamesString[] =
 		NULL, NULL, NULL, NULL
 	},
 
+	{
+		{"replication_delay_source_cmd", CFGCXT_RELOAD, STREAMING_REPLICATION_CONFIG,
+			"External command to retrieve replication delay information.",
+			CONFIG_VAR_TYPE_STRING, false, 0
+		},
+		&g_pool_config.replication_delay_source_cmd,
+		"",
+		NULL, NULL, NULL, NULL
+	},
+
 	{
 		{"failback_command", CFGCXT_RELOAD, FAILOVER_CONFIG,
 			"Command to execute when backend node is attached.",
@@ -2323,6 +2333,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"replication_delay_source_timeout", CFGCXT_RELOAD, STREAMING_REPLICATION_CONFIG,
+			"Timeout for external replication delay command execution in seconds.",
+			CONFIG_VAR_TYPE_INT, false, 0
+		},
+		&g_pool_config.replication_delay_source_timeout,
+		10,
+		1, 3600,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	EMPTY_CONFIG_INT
 };
diff --git a/src/include/pool_config.h b/src/include/pool_config.h
index be82750e5..67275559f 100644
--- a/src/include/pool_config.h
+++ b/src/include/pool_config.h
@@ -94,7 +94,6 @@ typedef enum LogStandbyDelayModes
 	LSD_NONE
 } LogStandbyDelayModes;
 
-
 typedef enum MemCacheMethod
 {
 	SHMEM_CACHE = 1,
@@ -371,6 +370,8 @@ typedef struct
 	char	   *sr_check_password;	/* password for sr_check_user */
 	char	   *sr_check_database;	/* PostgreSQL database name for streaming
 									 * replication check */
+	char	   *replication_delay_source_cmd;	/* external command for replication delay */
+	int			replication_delay_source_timeout;	/* timeout for external command in seconds */
 	char	   *failover_command;	/* execute command when failover happens */
 	char	   *follow_primary_command; /* execute command when failover is
 										 * ended */
diff --git a/src/sample/pgpool.conf.sample-stream b/src/sample/pgpool.conf.sample-stream
index a7eb594c9..2ccc907ea 100644
--- a/src/sample/pgpool.conf.sample-stream
+++ b/src/sample/pgpool.conf.sample-stream
@@ -519,6 +519,20 @@ backend_clustering_mode = streaming_replication
 
 #sr_check_database = 'postgres'
                                    # Database name for streaming replication check
+
+#replication_delay_source_cmd = ''
+                                   # External command to retrieve replication delay information
+                                   # If set, pgpool uses this command instead of built-in queries
+                                   # Command receives replica node identifiers (host:port) as arguments
+                                   # Primary node is omitted from arguments
+                                   # Command should output one delay value (in ms) per replica
+                                   # Use -1 to indicate a replica that is down but not yet detected
+                                   # Format: "25 100" for 2 replicas (e.g., 3-node cluster with 1 primary)
+                                   # Command runs as the pgpool process user
+#replication_delay_source_timeout = 10
+                                   # Timeout for external command execution in seconds
+                                   # Range: 1-3600 seconds (default: 10)
+
 #delay_threshold = 0
                                    # Threshold before not dispatching query to standby node
                                    # Unit is in bytes
diff --git a/src/streaming_replication/pool_worker_child.c b/src/streaming_replication/pool_worker_child.c
index 4f8f823a3..07559ba45 100644
--- a/src/streaming_replication/pool_worker_child.c
+++ b/src/streaming_replication/pool_worker_child.c
@@ -76,6 +76,8 @@ static volatile sig_atomic_t restart_request = 0;
 static void establish_persistent_connection(void);
 static void discard_persistent_connection(void);
 static void check_replication_time_lag(void);
+static void check_replication_time_lag_with_cmd(void);
+static char *build_instance_identifier_for_node(int node_id);
 static void CheckReplicationTimeLagErrorCb(void *arg);
 static unsigned long long int text_to_lsn(char *text);
 static RETSIGTYPE my_signal_handler(int sig);
@@ -259,7 +261,12 @@ do_worker_child(void)
 					POOL_NODE_STATUS *node_status;
 					int			i;
 
-					/* Do replication time lag checking */
+				/* Do replication time lag checking */
+				/* Use external command if replication_delay_source_cmd is configured */
+				if (pool_config->replication_delay_source_cmd &&
+					strlen(pool_config->replication_delay_source_cmd) > 0)
+					check_replication_time_lag_with_cmd();
+				else
 					check_replication_time_lag();
 
 					/* Check node status */
@@ -659,6 +666,337 @@ check_replication_time_lag(void)
 	error_context_stack = callback.previous;
 }
 
+#define MAX_CMD_OUTPUT 4096
+#define MAX_REASONABLE_DELAY_MS 3600000.0  /* 1 hour in milliseconds */
+
+/* Global variable to track command timeout */
+static volatile sig_atomic_t command_timeout_occurred = 0;
+
+/*
+ * Signal handler for command timeou
+ */
+static void
+command_timeout_handler(int sig)
+{
+	command_timeout_occurred = 1;
+}
+
+
+
+/*
+ * Check replication time lag using external command
+ *
+ * The external command receives only replica (standby) node identifiers as arguments,
+ * omitting the primary node. It returns delay values in milliseconds for each replica.
+ * A value of -1 indicates a node that is down but not yet detected by pgpool's health checks.
+ */
+static void
+check_replication_time_lag_with_cmd(void)
+{
+	FILE		   *fp;
+	char		   *command;
+	char		   *line;
+	char		   *token;
+	char		   *saveptr;
+	int				replica_idx;
+	int				num_replicas;
+	double			delay_ms;
+	uint64			delay;
+	int				token_count = 0;
+	BackendInfo	   *bkinfo;
+	ErrorContextCallback callback;
+
+	if (NUM_BACKENDS <= 1)
+	{
+		/* If there's only one node, there's no point to do checking */
+		return;
+	}
+
+	if (REAL_PRIMARY_NODE_ID < 0)
+	{
+		/* No need to check if there's no primary */
+		return;
+	}
+
+	if (!VALID_BACKEND(REAL_PRIMARY_NODE_ID))
+	{
+		/* No need to check replication delay if primary is down */
+		return;
+	}
+
+	if (!pool_config->replication_delay_source_cmd ||
+		strlen(pool_config->replication_delay_source_cmd) == 0)
+	{
+		ereport(WARNING,
+				(errmsg("replication_delay_source_cmd is not configured"),
+				 errhint("Set replication_delay_source_cmd to use external command mode")));
+		/* Fall back to builtin method */
+		check_replication_time_lag();
+		return;
+	}
+
+	/* Allocate buffer for command output */
+	line = palloc(MAX_CMD_OUTPUT);
+	fp = NULL;
+
+	/*
+	 * Register a error context callback to throw proper context message
+	 */
+	callback.callback = CheckReplicationTimeLagErrorCb;
+	callback.arg = NULL;
+	callback.previous = error_context_stack;
+	error_context_stack = &callback;
+
+	/* Execute command as current process user */
+	PG_TRY();
+	{
+		const char *base_command = pool_config->replication_delay_source_cmd;
+		size_t total_len = strlen(base_command) + 1; /* +1 for NUL */
+		
+		/* Build command with replica-only arguments (omit primary) */
+		/* Calculate total command length including space-separated replica identifiers */
+		for (int i = 0; i < NUM_BACKENDS; i++)
+		{
+			if (i == REAL_PRIMARY_NODE_ID)
+				continue; /* Skip primary node */
+			
+			char *ident = build_instance_identifier_for_node(i);
+			total_len += 1 /* space */ + strlen(ident);
+			pfree(ident);
+		}
+
+		command = palloc(total_len);
+		strlcpy(command, base_command, total_len);
+
+		/* Append replica identifiers */
+		for (int i = 0; i < NUM_BACKENDS; i++)
+		{
+			if (i == REAL_PRIMARY_NODE_ID)
+				continue; /* Skip primary node */
+			
+			char *ident = build_instance_identifier_for_node(i);
+			strlcat(command, " ", total_len);
+			strlcat(command, ident, total_len);
+			pfree(ident);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("executing replication delay command: %s", command)));
+
+		/* Set up timeout for command execution */
+		command_timeout_occurred = 0;
+		signal(SIGALRM, command_timeout_handler);
+		alarm(pool_config->replication_delay_source_timeout);
+
+		fp = popen(command, "r");
+		if (fp == NULL)
+		{
+			alarm(0); /* Cancel alarm */
+			signal(SIGALRM, SIG_DFL);
+			ereport(ERROR,
+					(errmsg("failed to execute replication delay command: %s", command),
+					 errdetail("popen failed: %m")));
+		}
+
+		if (fgets(line, MAX_CMD_OUTPUT, fp) == NULL)
+		{
+			int pclose_result = pclose(fp);
+			fp = NULL;
+			alarm(0); /* Cancel alarm */
+			signal(SIGALRM, SIG_DFL);
+
+			if (command_timeout_occurred)
+			{
+				ereport(ERROR,
+						(errmsg("replication delay command timed out after %d seconds: %s",
+								pool_config->replication_delay_source_timeout, command),
+						 errhint("Consider increasing replication_delay_source_timeout or optimizing the command")));
+			}
+			else
+			{
+				ereport(ERROR,
+						(errmsg("failed to read output from replication delay command: %s", command),
+						 errdetail("command exit status: %d", pclose_result)));
+			}
+		}
+
+		alarm(0); /* Cancel alarm */
+		signal(SIGALRM, SIG_DFL);
+
+		/* Check if output was truncated */
+		if (strlen(line) == MAX_CMD_OUTPUT - 1 && line[MAX_CMD_OUTPUT - 2] != '\n')
+		{
+			ereport(WARNING,
+					(errmsg("replication delay command output may have been truncated")));
+		}
+
+		pclose(fp);
+		fp = NULL;
+		pfree(command);
+		command = NULL;
+
+		/* Set primary node delay to 0 */
+		bkinfo = pool_get_node_info(REAL_PRIMARY_NODE_ID);
+		bkinfo->standby_delay = 0;
+		bkinfo->standby_delay_by_time = true;
+
+		/* Count expected replicas */
+		num_replicas = NUM_BACKENDS - 1; /* Total nodes minus primary */
+
+		/* Count tokens in output for validation */
+		char *line_copy = pstrdup(line);
+		char *temp_token = strtok(line_copy, " \t\n");
+		while (temp_token != NULL)
+		{
+			token_count++;
+			temp_token = strtok(NULL, " \t\n");
+		}
+		pfree(line_copy);
+
+		if (token_count != num_replicas)
+		{
+			ereport(WARNING,
+					(errmsg("replication delay command returned %d values, expected %d (one per replica, excluding primary)",
+							token_count, num_replicas),
+					 errhint("Command should output one delay value per replica node")));
+		}
+
+		/* Parse the output - one delay value per replica in order */
+		token = strtok_r(line, " \t\n", &saveptr);
+		replica_idx = 0;
+
+		for (int i = 0; i < NUM_BACKENDS && token != NULL; i++)
+		{
+			if (i == REAL_PRIMARY_NODE_ID)
+				continue; /* Skip primary - it's not in the output */
+
+			if (!VALID_BACKEND(i))
+			{
+				/* Skip invalid backend but consume token */
+				token = strtok_r(NULL, " \t\n", &saveptr);
+				replica_idx++;
+				continue;
+			}
+
+			char *endptr;
+			delay_ms = strtod(token, &endptr);
+
+			/* Validate the conversion */
+			if (*endptr != '\0')
+			{
+				ereport(WARNING,
+						(errmsg("invalid delay value '%s' for node %d, treating as 0",
+								token, i)));
+				delay_ms = 0;
+			}
+
+			bkinfo = pool_get_node_info(i);
+
+			/* Handle -1 for down nodes */
+			if (delay_ms == -1.0)
+			{
+				ereport(LOG,
+						(errmsg("node %d reported as down by external command (delay -1), relying on health check for failover decision",
+								i)));
+				/* Keep previous delay value, don't trigger failover */
+				token = strtok_r(NULL, " \t\n", &saveptr);
+				replica_idx++;
+				continue;
+			}
+
+			/* Validate delay value range */
+			if (delay_ms < 0)
+			{
+				ereport(WARNING,
+						(errmsg("negative delay value %.3f for node %d (other than -1), treating as 0",
+								delay_ms, i)));
+				delay_ms = 0;
+			}
+			else if (delay_ms > MAX_REASONABLE_DELAY_MS)
+			{
+				ereport(WARNING,
+						(errmsg("extremely large delay value %.3f for node %d",
+								delay_ms, i)));
+			}
+
+			/* Convert delay from milliseconds to microseconds for internal storage */
+			delay = (uint64)(delay_ms * 1000);
+			bkinfo->standby_delay = delay;
+			bkinfo->standby_delay_by_time = true;
+
+			/* Log delay if necessary */
+			uint64 delay_threshold_by_time = pool_config->delay_threshold_by_time * 1000; /* threshold is in milliseconds, convert to microseconds */
+
+			if ((pool_config->log_standby_delay == LSD_ALWAYS && delay_ms > 0) ||
+				(pool_config->log_standby_delay == LSD_OVER_THRESHOLD &&
+				 bkinfo->standby_delay > delay_threshold_by_time))
+			{
+				ereport(LOG,
+						(errmsg("Replication of node: %d is behind %.3f second(s) from the primary server (node: %d) [external command]",
+								i, delay_ms / 1000, REAL_PRIMARY_NODE_ID)));
+			}
+
+			token = strtok_r(NULL, " \t\n", &saveptr);
+			replica_idx++;
+		}
+
+	}
+	PG_CATCH();
+	{
+		/* Cleanup in case of error */
+		alarm(0); /* Cancel any pending alarm */
+		signal(SIGALRM, SIG_DFL);
+		if (fp)
+		{
+			pclose(fp);
+			fp = NULL;
+		}
+		if (line)
+			pfree(line);
+		if (command)
+			pfree(command);
+		error_context_stack = callback.previous;
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	/* Normal cleanup */
+	if (line)
+		pfree(line);
+
+	error_context_stack = callback.previous;
+}
+
+/*
+ * build_instance_identifier_for_node
+ *  Build an identifier string for a backend node for passing to external commands.
+ *  Format: "<hostname>:<port>"
+ */
+static char *
+build_instance_identifier_for_node(int node_id)
+{
+	BackendInfo *bi = pool_get_node_info(node_id);
+	size_t hlen;
+	size_t out_len;
+	char *out;
+
+	if (!bi || bi->backend_hostname[0] == '\0' || bi->backend_port <= 0)
+	{
+		/* Fallback if hostname or port is not set */
+		out = palloc(32);
+		snprintf(out, 32, "unknown_node_%d", node_id);
+		return out;
+	}
+
+	/* Use hostname:port format */
+	hlen = strlen(bi->backend_hostname);
+	/* max port chars ~5, plus colon and NUL */
+	out_len = hlen + 1 + 5 + 1;
+	out = palloc(out_len);
+	snprintf(out, out_len, "%s:%d", bi->backend_hostname, bi->backend_port);
+	return out;
+}
+
 static void
 CheckReplicationTimeLagErrorCb(void *arg)
 {
-- 
2.51.1



  [application/octet-stream] 0002-external-replication-delay-tests-and-docs.patch (33.0K, 4-0002-external-replication-delay-tests-and-docs.patch)
  download | inline diff:
From c31ce0ed25610de95aa55c97529495a0905f138d Mon Sep 17 00:00:00 2001
From: Nadav Shatz <[email protected]>
Date: Mon, 27 Oct 2025 16:22:40 +0200
Subject: [PATCH 2/2] test+doc: tests and documentation for external
 replication delay

Tests:
- Verify command receives replicas only (primary omitted)
- Verify host:port identifier format
- Test -1 handling for down nodes
- Test integer and float delay values
- Test validation, timeouts, and error handling
- Test wrong output counts and edge cases

Documentation:
- Remove replication_delay_source enum documentation
- Document replication_delay_source_cmd with replica-only semantics
- Document -1 for down nodes
- Provide examples with correct output format
- Update replication_delay_source_timeout docs
---
 doc/src/sgml/stream-check.sgml                |  68 +++
 .../041.external_replication_delay/README     |  59 +++
 .../041.external_replication_delay/test.sh    | 396 ++++++++++++++++++
 .../test_parsing.sh                           |  55 +++
 .../test_validation.sh                        | 318 ++++++++++++++
 5 files changed, 896 insertions(+)
 create mode 100644 src/test/regression/tests/041.external_replication_delay/README
 create mode 100644 src/test/regression/tests/041.external_replication_delay/test.sh
 create mode 100644 src/test/regression/tests/041.external_replication_delay/test_parsing.sh
 create mode 100644 src/test/regression/tests/041.external_replication_delay/test_validation.sh

diff --git a/doc/src/sgml/stream-check.sgml b/doc/src/sgml/stream-check.sgml
index d2ca3ca49..fc4799080 100644
--- a/doc/src/sgml/stream-check.sgml
+++ b/doc/src/sgml/stream-check.sgml
@@ -309,6 +309,74 @@ GRANT pg_monitor TO sr_check_user;
     </listitem>
   </varlistentry>
 
+  <varlistentry id="guc-replication-delay-source-cmd" xreflabel="replication_delay_source_cmd">
+   <term><varname>replication_delay_source_cmd</varname> (<type>string</type>)
+    <indexterm>
+     <primary><varname>replication_delay_source_cmd</varname> configuration parameter</primary>
+    </indexterm>
+   </term>
+   <listitem>
+    <para>
+     Specifies an external command to retrieve replication delay information for replica nodes.
+     When this parameter is set and not empty, <productname>Pgpool-II</productname> uses the
+     external command instead of built-in database queries to obtain replication delays.
+     The command is executed as the <productname>Pgpool-II</productname> process user.
+    </para>
+    <para>
+     The command receives replica node identifiers as positional arguments, with the primary
+     node omitted. Each identifier is in the format <literal>&lt;hostname&gt;:&lt;port&gt;</literal>,
+     for example <literal>server1:5432 server2:5432</literal>. The order matches
+     <productname>Pgpool-II</productname>'s backend order (excluding the primary), allowing the
+     script to correlate external metrics (such as from AWS CloudWatch for Aurora) to the correct nodes.
+    </para>
+    <para>
+     The command must write a single line to stdout containing one whitespace-separated delay value
+     per replica, in milliseconds, in the same order as the arguments. The primary node's delay is
+     implicitly zero and should not be included in the output. Delay values can be integers or
+     floating-point numbers.
+    </para>
+    <para>
+     Special value: <literal>-1</literal> indicates a replica that is down but not yet detected
+     by <productname>Pgpool-II</productname>'s health checks. <productname>Pgpool-II</productname>
+     will log this condition but rely on its own health-check logic to decide whether to trigger
+     failover; no failover is triggered solely by receiving <literal>-1</literal>.
+    </para>
+    <para>
+     Example for a 3-node cluster (1 primary + 2 replicas): if the command receives arguments
+     <literal>server1:5432 server2:5432</literal>, it should output <literal>"25.5 100"</literal>
+     to indicate the first replica has 25.5ms delay and the second has 100ms delay.
+    </para>
+    <para>
+     Default is empty (use built-in replication delay queries).
+    </para>
+    <para>
+     This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+    </para>
+   </listitem>
+  </varlistentry>
+
+  <varlistentry id="guc-replication-delay-source-timeout" xreflabel="replication_delay_source_timeout">
+   <term><varname>replication_delay_source_timeout</varname> (<type>integer</type>)
+    <indexterm>
+     <primary><varname>replication_delay_source_timeout</varname> configuration parameter</primary>
+    </indexterm>
+   </term>
+   <listitem>
+    <para>
+     Specifies the timeout in seconds for the external command specified by
+     <xref linkend="guc-replication-delay-source-cmd">.
+     If the command does not finish within the timeout, <productname>Pgpool-II</productname>
+     logs an error and continues using the built-in method.
+    </para>
+    <para>
+     Default is 10 seconds. Valid range is 1-3600 seconds.
+    </para>
+    <para>
+     This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+    </para>
+   </listitem>
+  </varlistentry>
+
   <varlistentry id="guc-log-standby-delay" xreflabel="log_standby_delay">
    <term><varname>log_standby_delay</varname> (<type>enum</type>)
     <indexterm>
diff --git a/src/test/regression/tests/041.external_replication_delay/README b/src/test/regression/tests/041.external_replication_delay/README
new file mode 100644
index 000000000..b4df5da40
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/README
@@ -0,0 +1,59 @@
+External Replication Delay Command Test
+========================================
+
+This test verifies the external command replication delay source feature.
+
+Test Coverage:
+- External command receives replica node identifiers only (primary omitted)
+- Instance identifiers in host:port format
+- Basic external command execution with integer and float millisecond values
+- Delay threshold functionality with external commands
+- Command execution as pgpool process user (no su wrapper)
+- Error handling for missing/invalid commands
+- Command execution failure scenarios
+- Command timeout handling with configurable timeout values
+- Input validation for invalid, negative (other than -1), and extremely large delay values
+- Handling of -1 for down nodes (logged but no immediate failover)
+- Wrong number of output values validation
+- Multiple -1 values (multiple down replicas)
+- Mixed scenarios (some replicas up, some down)
+- Output truncation detection
+
+Files:
+- test.sh: Main test script
+- test_parsing.sh: Unit test for parsing logic
+- test_validation.sh: Validation and edge case testing
+- README: This documentation
+
+Key Changes from Original Version:
+- Primary node is omitted from command arguments
+- Command receives only replica identifiers
+- Instance identifiers are in host:port format (not application_name)
+- Output format: one delay per replica (not per all nodes)
+- -1 value indicates down replica without triggering failover
+- Format example: "25 100" for 2 replicas (3-node cluster = 1 primary + 2 replicas)
+
+The test creates temporary command scripts that output delay values in the format:
+"replica1_delay replica2_delay ..."
+
+Where delays are in milliseconds and can be integer or floating-point values.
+Special value -1 indicates a replica that is down but not yet detected by pgpool.
+
+Test Environment:
+- Uses streaming replication mode with 3 nodes
+- Node 0 is primary (omitted from command arguments)
+- Nodes 1 and 2 are replicas (included in command arguments)
+- Configures sr_check_period = 1 second for faster testing
+- Tests various delay scenarios and threshold behaviors
+
+Expected Behavior:
+- External commands receive replica identifiers in host:port format
+- Primary node identifier is never passed to command
+- Command outputs one delay value per replica
+- -1 values are logged but don't trigger immediate failover
+- Delay values are parsed correctly (both int and float)
+- Threshold comparisons work properly
+- Error conditions are handled gracefully
+- Commands timeout appropriately based on configuration
+- Timeout errors provide helpful messages and hints
+- Tests are reliable with proper wait mechanisms instead of fixed sleeps
diff --git a/src/test/regression/tests/041.external_replication_delay/test.sh b/src/test/regression/tests/041.external_replication_delay/test.sh
new file mode 100644
index 000000000..f5675af98
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/test.sh
@@ -0,0 +1,396 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for external command replication delay source
+#
+source $TESTLIBS
+TESTDIR=testdir
+PG_CTL=$PGBIN/pg_ctl
+PSQL="$PGBIN/psql -X "
+
+rm -fr $TESTDIR
+mkdir $TESTDIR
+cd $TESTDIR
+
+# create test environment
+echo -n "creating test environment..."
+$PGPOOL_SETUP -m s -n 3 || exit 1
+echo "done."
+source ./bashrc.ports
+export PGPORT=$PGPOOL_PORT
+
+# Create external command scripts for testing
+# NOTE: Commands now only output delay values for REPLICAS (not primary)
+cat > delay_cmd_static.sh << 'EOF'
+#!/bin/bash
+# Static delay values for replicas: node1=25ms, node2=50ms (node0 is primary, not included)
+echo "25 50"
+EOF
+chmod +x delay_cmd_static.sh
+
+cat > delay_cmd_float.sh << 'EOF'
+#!/bin/bash
+# Float delay values for replicas: node1=25.5ms, node2=100.75ms
+echo "25.5 100.75"
+EOF
+chmod +x delay_cmd_float.sh
+
+cat > delay_cmd_high.sh << 'EOF'
+#!/bin/bash
+# High delay values to test threshold: node1=2000ms, node2=3000ms
+echo "2000 3000"
+EOF
+chmod +x delay_cmd_high.sh
+
+# ----------------------------------------------------------------------------------------
+echo === Test0: External command receives replica identifiers only (primary omitted) ===
+# ----------------------------------------------------------------------------------------
+# Command that captures its arguments and outputs valid delays for 2 replicas
+cat > delay_cmd_args.sh << 'EOF'
+#!/bin/bash
+printf "%s " "$@" > args.txt
+echo "25 50"
+EOF
+chmod +x delay_cmd_args.sh
+
+echo "replication_delay_source_cmd = './delay_cmd_args.sh'" >> etc/pgpool.conf
+echo "sr_check_period = 1" >> etc/pgpool.conf
+echo "log_min_messages = 'DEBUG1'" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+echo "Waiting for sr_check to pass args..."
+for i in {1..10}; do
+    if [ -f args.txt ]; then
+        break
+    fi
+    sleep 1
+done
+
+if [ ! -f args.txt ]; then
+    echo fail: did not capture command arguments
+    ./shutdownall
+    exit 1
+fi
+
+ARGS_CONTENT=$(cat args.txt | sed 's/[[:space:]]*$//')
+# Should receive 2 replica identifiers in host:port format (server1:11003 server2:11004)
+# Primary (server0:11002) should be omitted
+if ! echo "$ARGS_CONTENT" | grep -q "server1:11003"; then
+    echo "fail: expected server1:11003 in arguments, got: '$ARGS_CONTENT'"
+    ./shutdownall
+    exit 1
+fi
+if ! echo "$ARGS_CONTENT" | grep -q "server2:11004"; then
+    echo "fail: expected server2:11004 in arguments, got: '$ARGS_CONTENT'"
+    ./shutdownall
+    exit 1
+fi
+if echo "$ARGS_CONTENT" | grep -q "server0:11002"; then
+    echo "fail: primary (server0:11002) should not be in arguments, got: '$ARGS_CONTENT'"
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: argument order correct - replicas only, primary omitted, host:port format
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test1: Basic external command with integer millisecond values ===
+# ----------------------------------------------------------------------------------------
+echo "replication_delay_source_cmd = './delay_cmd_static.sh'" >> etc/pgpool.conf
+echo "sr_check_period = 1" >> etc/pgpool.conf
+echo "log_standby_delay = 'always'" >> etc/pgpool.conf
+echo "log_min_messages = 'DEBUG1'" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+$PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+EOF
+
+# Wait for sr_check to run and populate delay values
+# sr_check_period is 1 second, so wait a bit longer to ensure it runs
+echo "Waiting for sr_check to run..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command" log/pgpool.log 2>/dev/null; then
+        echo "Command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SHOW POOL_NODES;
+EOF
+
+# Check that delay values are populated in the log
+grep "executing replication delay command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: external command was not executed
+    echo "Log contents:"
+    tail -20 log/pgpool.log
+    ./shutdownall
+    exit 1
+fi
+
+# Verify actual delay values were parsed
+if ! $PSQL -t -c "SHOW POOL_NODES" test | grep -E "[0-9]+\.[0-9]+" >/dev/null; then
+    echo "Warning: No delay values found in POOL_NODES output"
+fi
+
+# Check for delay log messages
+grep "Replication of node.*external command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: external command delay logging not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: basic external command test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test2: External command with floating-point millisecond values ===
+# ----------------------------------------------------------------------------------------
+# Update configuration to use float command
+sed -i.bak "s|delay_cmd_static.sh|delay_cmd_float.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run with float values
+echo "Waiting for sr_check with float values..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*delay_cmd_float.sh" log/pgpool.log 2>/dev/null; then
+        echo "Float command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SHOW POOL_NODES;
+EOF
+
+# Check that float values are handled correctly
+grep "executing replication delay command.*delay_cmd_float.sh" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: float command was not executed
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: floating-point values test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test3: External command with delay threshold ===
+# ----------------------------------------------------------------------------------------
+# Update configuration to use high delay command and set threshold
+sed -i.bak "s|delay_cmd_float.sh|delay_cmd_high.sh|" etc/pgpool.conf
+echo "delay_threshold_by_time = 1000" >> etc/pgpool.conf
+echo "backend_weight0 = 0" >> etc/pgpool.conf  # Force queries to standby normally
+echo "backend_weight2 = 0" >> etc/pgpool.conf  # Only use node 1 as standby
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run and detect high delays
+echo "Waiting for sr_check with high delay values..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*delay_cmd_high.sh" log/pgpool.log 2>/dev/null; then
+        echo "High delay command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SELECT * FROM t1 LIMIT 1;
+EOF
+
+# With high delays (2000ms > 1000ms threshold), query should go to primary (node 0)
+grep "SELECT \* FROM t1 LIMIT 1.*DB node id: 0" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: query was not sent to primary node despite high delay
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: delay threshold test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test4: External command execution as process user ===
+# ----------------------------------------------------------------------------------------
+# Test that command runs as the current pgpool process user
+sed -i.bak "s|delay_cmd_high.sh|delay_cmd_static.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for sr_check to run as process user..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*delay_cmd_static.sh" log/pgpool.log 2>/dev/null; then
+        echo "Command executed as process user after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check that command was executed (without su wrapper)
+grep "executing replication delay command.*delay_cmd_static.sh" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: command was not executed as process user
+    ./shutdownall
+    exit 1
+fi
+
+# Verify no su command was used
+if grep -q "executing replication delay command.*su.*" log/pgpool.log 2>/dev/null; then
+    echo fail: command should not use su wrapper
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: process user execution test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test5: Error handling - missing command ===
+# ----------------------------------------------------------------------------------------
+# Test error handling when command is not configured
+sed -i.bak "s|replication_delay_source_cmd = './delay_cmd_static.sh'|replication_delay_source_cmd = ''|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# With empty command, should fall back to builtin method
+# No specific error message expected - just verify it doesn't crash
+sleep 3
+
+echo ok: empty command test succeeded (fallback to builtin)
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test6: Error handling - command execution failure ===
+# ----------------------------------------------------------------------------------------
+# Test error handling when command fails
+echo "replication_delay_source_cmd = './nonexistent_command.sh'" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run with failing command
+echo "Waiting for sr_check with failing command..."
+for i in {1..5}; do
+    if grep -q "failed to execute replication delay command" log/pgpool.log 2>/dev/null; then
+        echo "Command failure detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for error message about command execution failure
+grep "failed to execute replication delay command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: command execution failure not detected
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: command failure test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test7: Command timeout handling ===
+# ----------------------------------------------------------------------------------------
+# Create a command that takes longer than the timeout
+cat > delay_cmd_slow.sh << 'EOF'
+#!/bin/bash
+# Slow command that takes 15 seconds (longer than default 10s timeout)
+sleep 15
+echo "25 50"
+EOF
+chmod +x delay_cmd_slow.sh
+
+# Set a short timeout and use the slow command
+sed -i.bak "s|replication_delay_source_cmd = './nonexistent_command.sh'|replication_delay_source_cmd = './delay_cmd_slow.sh'|" etc/pgpool.conf
+echo "replication_delay_source_timeout = 3" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run and timeout
+echo "Waiting for command timeout..."
+for i in {1..15}; do
+    if grep -q "replication delay command timed out" log/pgpool.log 2>/dev/null; then
+        echo "Command timeout detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for timeout error message
+grep "replication delay command timed out after 3 seconds" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: command timeout not detected
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: command timeout test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test8: Handling of -1 for down nodes ===
+# ----------------------------------------------------------------------------------------
+# Create a command that returns -1 for one replica
+cat > delay_cmd_with_down_node.sh << 'EOF'
+#!/bin/bash
+# Return -1 for first replica (indicating it's down), normal value for second
+echo "-1 50"
+EOF
+chmod +x delay_cmd_with_down_node.sh
+
+# Reset config
+rm -f etc/pgpool.conf.bak
+sed -i.bak "s|delay_cmd_slow.sh|delay_cmd_with_down_node.sh|" etc/pgpool.conf
+sed -i.bak "s|replication_delay_source_timeout = 3|replication_delay_source_timeout = 10|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to process -1 value
+echo "Waiting for sr_check to process -1 value..."
+for i in {1..10}; do
+    if grep -q "node.*reported as down by external command.*delay -1" log/pgpool.log 2>/dev/null; then
+        echo "-1 handling detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for -1 logging message
+grep "node.*reported as down by external command.*delay -1.*relying on health check" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: -1 handling message not found
+    ./shutdownall
+    exit 1
+fi
+
+# Verify that pgpool didn't crash or trigger failover just from -1
+if grep -q "failover" log/pgpool.log 2>/dev/null; then
+    echo "fail: -1 should not trigger immediate failover"
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: -1 handling test succeeded
+./shutdownall
+
+echo "All external replication delay tests passed!"
+exit 0
diff --git a/src/test/regression/tests/041.external_replication_delay/test_parsing.sh b/src/test/regression/tests/041.external_replication_delay/test_parsing.sh
new file mode 100644
index 000000000..d024ce559
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/test_parsing.sh
@@ -0,0 +1,55 @@
+#!/bin/bash
+#-------------------------------------------------------------------
+# Unit test for external command parsing logic
+# This tests the parsing without needing a full pgpool setup
+#
+
+echo "=== Testing external command output parsing ==="
+
+# Test 1: Integer values
+echo "Test 1: Integer millisecond values"
+echo "0 25 50" > test_output.txt
+echo "Expected: 0ms, 25ms, 50ms"
+echo "Output: $(cat test_output.txt)"
+echo ""
+
+# Test 2: Float values
+echo "Test 2: Floating-point millisecond values"
+echo "0 25.5 100.75" > test_output_float.txt
+echo "Expected: 0ms, 25.5ms, 100.75ms"
+echo "Output: $(cat test_output_float.txt)"
+echo ""
+
+# Test 3: High precision float values
+echo "Test 3: High precision values"
+echo "0 0.001 999.999" > test_output_precision.txt
+echo "Expected: 0ms, 0.001ms, 999.999ms"
+echo "Output: $(cat test_output_precision.txt)"
+echo ""
+
+# Test 4: Edge case - zero values
+echo "Test 4: All zero values"
+echo "0 0 0" > test_output_zeros.txt
+echo "Expected: 0ms, 0ms, 0ms"
+echo "Output: $(cat test_output_zeros.txt)"
+echo ""
+
+# Test 5: Edge case - large values
+echo "Test 5: Large delay values"
+echo "0 5000 10000" > test_output_large.txt
+echo "Expected: 0ms, 5000ms, 10000ms"
+echo "Output: $(cat test_output_large.txt)"
+echo ""
+
+# Test 6: Mixed integer and float values
+echo "Test 6: Mixed integer and float values"
+echo "0 25 50.5" > test_output_mixed.txt
+echo "Expected: 0ms, 25ms, 50.5ms"
+echo "Output: $(cat test_output_mixed.txt)"
+echo ""
+
+# Cleanup
+rm -f test_output_*.txt
+
+echo "All parsing tests completed. These outputs should be parseable by the external command feature."
+
diff --git a/src/test/regression/tests/041.external_replication_delay/test_validation.sh b/src/test/regression/tests/041.external_replication_delay/test_validation.sh
new file mode 100644
index 000000000..2d96c91a9
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/test_validation.sh
@@ -0,0 +1,318 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for external command validation and edge cases
+#
+source $TESTLIBS
+TESTDIR=testdir_validation
+PG_CTL=$PGBIN/pg_ctl
+PSQL="$PGBIN/psql -X "
+
+rm -fr $TESTDIR
+mkdir $TESTDIR
+cd $TESTDIR
+
+# create test environment
+echo -n "creating test environment..."
+$PGPOOL_SETUP -m s -n 3 || exit 1
+echo "done."
+source ./bashrc.ports
+export PGPORT=$PGPOOL_PORT
+
+# Create test command scripts
+# NOTE: All commands output values for REPLICAS only (primary omitted)
+cat > delay_cmd_validation.sh << 'EOF'
+#!/bin/bash
+# Test validation: output with invalid values for 2 replicas
+echo "invalid_value 50.5"
+EOF
+chmod +x delay_cmd_validation.sh
+
+cat > delay_cmd_negative.sh << 'EOF'
+#!/bin/bash
+# Test negative values (other than -1)
+echo "-25 50"
+EOF
+chmod +x delay_cmd_negative.sh
+
+cat > delay_cmd_large.sh << 'EOF'
+#!/bin/bash
+# Test extremely large values
+echo "9999999 50"
+EOF
+chmod +x delay_cmd_large.sh
+
+cat > delay_cmd_wrong_count.sh << 'EOF'
+#!/bin/bash
+# Test wrong number of values (only 1 instead of 2 for 2 replicas)
+echo "25"
+EOF
+chmod +x delay_cmd_wrong_count.sh
+
+# ----------------------------------------------------------------------------------------
+echo === Test1: Validation of invalid delay values ===
+# ----------------------------------------------------------------------------------------
+echo "replication_delay_source_cmd = './delay_cmd_validation.sh'" >> etc/pgpool.conf
+echo "sr_check_period = 1" >> etc/pgpool.conf
+echo "log_standby_delay = 'always'" >> etc/pgpool.conf
+echo "log_min_messages = 'DEBUG1'" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+$PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+EOF
+
+# Wait for sr_check to run
+echo "Waiting for validation test..."
+for i in {1..10}; do
+    if grep -q "invalid delay value" log/pgpool.log 2>/dev/null; then
+        echo "Validation error detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for validation warning
+grep "invalid delay value 'invalid_value' for node" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: validation warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: invalid value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test2: Negative delay values (other than -1) ===
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_validation.sh|delay_cmd_negative.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for negative value test..."
+for i in {1..10}; do
+    if grep -q "negative delay value.*other than -1" log/pgpool.log 2>/dev/null; then
+        echo "Negative value warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for negative value warning
+grep "negative delay value.*other than -1.*treating as 0" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: negative value warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: negative value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test3: Extremely large delay values ===
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_negative.sh|delay_cmd_large.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for large value test..."
+for i in {1..10}; do
+    if grep -q "extremely large delay value" log/pgpool.log 2>/dev/null; then
+        echo "Large value warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for large value warning
+grep "extremely large delay value.*for node" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: large value warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: large value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test4: Wrong number of output values ===
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_large.sh|delay_cmd_wrong_count.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for wrong count test..."
+for i in {1..10}; do
+    if grep -q "returned.*values, expected.*replica" log/pgpool.log 2>/dev/null; then
+        echo "Wrong count warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for wrong count warning
+grep "returned.*values, expected.*replica.*Command should output one delay value per replica" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: wrong count validation test not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: wrong count validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test5: Multiple -1 values ===
+# ----------------------------------------------------------------------------------------
+cat > delay_cmd_multi_down.sh << 'EOF'
+#!/bin/bash
+# Test multiple replicas down
+echo "-1 -1"
+EOF
+chmod +x delay_cmd_multi_down.sh
+
+sed -i.bak "s|delay_cmd_wrong_count.sh|delay_cmd_multi_down.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for multi-down test..."
+for i in {1..10}; do
+    if grep -q "node.*reported as down by external command" log/pgpool.log 2>/dev/null; then
+        echo "Multiple down nodes detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for multiple -1 handling
+DOWN_COUNT=$(grep -c "node.*reported as down by external command.*delay -1" log/pgpool.log)
+if [ "$DOWN_COUNT" -lt 2 ]; then
+    echo fail: expected 2 down node messages, found $DOWN_COUNT
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: multiple -1 handling test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test6: Command timeout with different timeout values ===
+# ----------------------------------------------------------------------------------------
+cat > delay_cmd_timeout.sh << 'EOF'
+#!/bin/bash
+# Command that takes 5 seconds
+sleep 5
+echo "25 50"
+EOF
+chmod +x delay_cmd_timeout.sh
+
+# Test with timeout shorter than command duration
+sed -i.bak "s|delay_cmd_multi_down.sh|delay_cmd_timeout.sh|" etc/pgpool.conf
+echo "replication_delay_source_timeout = 2" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for timeout
+echo "Waiting for timeout test (2s timeout, 5s command)..."
+for i in {1..10}; do
+    if grep -q "replication delay command timed out after 2 seconds" log/pgpool.log 2>/dev/null; then
+        echo "Timeout detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for timeout message
+grep "replication delay command timed out after 2 seconds" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: timeout not detected
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: timeout test succeeded
+./shutdownall
+
+# Test with timeout longer than command duration
+sed -i.bak "s|replication_delay_source_timeout = 2|replication_delay_source_timeout = 10|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for successful execution
+echo "Waiting for successful execution (10s timeout, 5s command)..."
+for i in {1..15}; do
+    if grep -q "executing replication delay command.*delay_cmd_timeout.sh" log/pgpool.log 2>/dev/null; then
+        echo "Command executed successfully after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Should not timeout this time
+if grep -q "replication delay command timed out" log/pgpool.log 2>/dev/null; then
+    echo fail: command should not have timed out with 10s timeout
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: extended timeout test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo === Test7: Mix of valid delays and -1 ===
+# ----------------------------------------------------------------------------------------
+cat > delay_cmd_mixed.sh << 'EOF'
+#!/bin/bash
+# One replica up (25ms), one down (-1)
+echo "25 -1"
+EOF
+chmod +x delay_cmd_mixed.sh
+
+sed -i.bak "s|delay_cmd_timeout.sh|delay_cmd_mixed.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check
+echo "Waiting for mixed delay test..."
+for i in {1..10}; do
+    if grep -q "node.*reported as down by external command" log/pgpool.log 2>/dev/null; then
+        echo "Mixed delay handling detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Should log one -1 and process one normal delay
+grep "node.*reported as down by external command.*delay -1" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: -1 not logged
+    ./shutdownall
+    exit 1
+fi
+
+# Should also log the normal replica delay
+grep "Replication of node.*external command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo "Note: Normal replica delay logging may not be visible with log_standby_delay settings"
+fi
+
+echo ok: mixed delay handling test succeeded
+./shutdownall
+
+echo "All validation tests passed!"
+exit 0
-- 
2.51.1



^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-10-30 23:45  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2025-10-30 23:45 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

> Hi,
> 
> I'm back at work - wdyt of this version?

Thanks for the patch! I will look into it weekend.

> (side note - Japan was incredible :))

Glad to hear that!

> On Tue, Sep 30, 2025 at 12:35 PM Tatsuo Ishii <[email protected]> wrote:
> 
>> > I would actually suggest including down state instances in case pgpool
>> > isn’t aware yet. It can exclude them once it does.
>> > For these cases maybe  -1 ?
>>
>> I don't think pgpool will triger failover even if
>> replication_delay_source_cmd returns -1 for such instance because
>> pgpool already has its own method to detect instance down (i.e. health
>> check) and method to avoid false positive
>> (i.e. failover_require_consensus).
>>
>> Still for such instaces replication_delay_source_cmd returns -1 maybe
>> useful if it's logged for admins.
>>
>> So I am Okay with the idea.
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
>> > Nadav Shatz
>> > Tailor Brands | CTO
>> >
>> >
>> > On Mon, Sep 22, 2025 at 7:34 AM Tatsuo Ishii <[email protected]>
>> wrote:
>> >
>> >> > Thank you for the kind words. We are having a great time!
>> >>
>> >> Glad to hear that!
>> >>
>> >> > Regarding the command knowing about the primary I think it is safe to
>> >> assume.
>> >>
>> >> Okay.
>> >>
>> >> > We can start this way and evolve in the future if needed.
>> >>
>> >> Agreed.
>> >>
>> >> > I can include a note about it in the notes that the command will only
>> >> receive the secondary instances as arguments.
>> >> >
>> >> > Anything else that comes to mind?
>> >>
>> >> Sounds like a reasonable requirement. Also the command excludes any
>> >> instance which is in down state?
>> >>
>> >> Best regards,
>> >> --
>> >> Tatsuo Ishii
>> >> SRA OSS K.K.
>> >> English: http://www.sraoss.co.jp/index_en/
>> >> Japanese:http://www.sraoss.co.jp
>> >>
>> >> > Nadav Shatz
>> >> > CTO
>> >> >
>> >> >> On Sep 16, 2025, at 7:30 PM, Tatsuo Ishii <[email protected]>
>> wrote:
>> >> >>
>> >> >> 
>> >> >>>
>> >> >>> Hi Tatsuo,
>> >> >>>
>> >> >>> Sorry for the late reply - I'm traveling with my family at the
>> moment
>> >> (in
>> >> >>> Japan actually)
>> >> >>
>> >> >> Excellent! Hope you and your family are spending great time in Japan.
>> >> >>
>> >> >>> and might be delayed in responding.
>> >> >>
>> >> >> No problem at all. I think you should focus on the travel at this
>> >> >> moment.
>> >> >>
>> >> >>> Re your points:
>> >> >>> 1 - we can, but I have to say that a user I tend to prefer
>> >> configuration
>> >> >>> values not have a "magic" value that does something different than
>> the
>> >> >>> usual case like this would create. I'd stick with what we already
>> have
>> >> >>> planned. happy to hear from others on the mailing list as well of
>> >> course.
>> >> >>
>> >> >> Makes sense. I withdraw my proposal.
>> >> >>
>> >> >>> 2 - I think we can have the primary always be the first or we can
>> >> >>> completely remove it since it might be redundant as it's always
>> going
>> >> to be
>> >> >>> 0. what do you think?
>> >> >>
>> >> >> What I am not sure is, whether we can assume the command always knows
>> >> >> which host (or IP) is primary? If the answer is yes, then we could
>> >> >> omit the primary. What do you think?
>> >> >>
>> >> >>> 3 - I agree with you, next version (after we clear everything else)
>> >> will
>> >> >>> have only ip/hostname+port.
>> >> >>
>> >> >> Thank you for understanding.
>> >> >>
>> >> >>> Let me know your thoughts
>> >> >>>
>> >> >>> Thanks!
>> >> >>>
>> >> >>>> On Tue, Sep 9, 2025 at 9:42 AM Tatsuo Ishii <[email protected]>
>> >> wrote:
>> >> >>>>
>> >> >>>> Hi Nadav,
>> >> >>>>
>> >> >>>>> Hi Tatsuo,
>> >> >>>>>
>> >> >>>>> Please find attached the 3 patch files (implementation, tests,
>> docs)
>> >> with
>> >> >>>>> the updates we discussed.
>> >> >>>>>
>> >> >>>>> What do you think?
>> >> >>>>
>> >> >>>> I haven't read the code details yet but I have a few questions.
>> >> >>>>
>> >> >>>> 1) Can we use only replication_delay_source_cmd and if it's value
>> is
>> >> >>>>   'builtin', then we treat it as replication_delay_source =
>> builtin?
>> >> >>>>   Maybe this is matter of taste but I would like to know your
>> >> >>>>   opinion.
>> >> >>>>
>> >> >>>> 2) replication_delay_source_cmd will be given an ordered list of
>> >> >>>>   instance identifiers. But it seems there's no way for the command
>> >> >>>>   which one is the primary instance. Is it okay for the command?
>> >> >>>>
>> >> >>>> 3) Why do you have 3 kind of instance identifiers (application
>> name,
>> >> >>>>   hostname (IP) + port and node id? I thought "hostname (IP) +
>> port"
>> >> >>>>   is sufficient.
>> >> >>>>
>> >> >>>> Comments?
>> >> >>>> --
>> >> >>>> Tatsuo Ishii
>> >> >>>> SRA OSS K.K.
>> >> >>>> English: http://www.sraoss.co.jp/index_en/
>> >> >>>> Japanese:http://www.sraoss.co.jp
>> >> >>>>
>> >> >>>
>> >> >>>
>> >> >>> --
>> >> >>> Nadav Shatz
>> >> >>> Tailor Brands | CTO
>> >> >
>> >> >
>> >>
>>
> 
> 
> -- 
> Nadav Shatz
> Tailor Brands | CTO


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-11-01 06:36  Tatsuo Ishii <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2025-11-01 06:36 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

>> Hi,
>> 
>> I'm back at work - wdyt of this version?
> 
> Thanks for the patch! I will look into it weekend.

Here are review comments.

1. git apply complains trailing whitespace and new blank line.

$ git apply ~/0001-external-replication-delay-implementation.patch 
/home/t-ishii/0001-external-replication-delay-implementation.patch:225: trailing whitespace.
		
/home/t-ishii/0001-external-replication-delay-implementation.patch:232: trailing whitespace.
			
/home/t-ishii/0001-external-replication-delay-implementation.patch:246: trailing whitespace.
			
warning: 3 lines add whitespace errors.
$ git apply ~/0002-external-replication-delay-tests-and-docs.patch 
/home/t-ishii/0002-external-replication-delay-tests-and-docs.patch:639: new blank line at EOF.
+
warning: 1 line adds whitespace errors.

2. You can use psprintf() instead of palloc() + snprintf() to make the code simpler.

+	if (!bi || bi->backend_hostname[0] == '\0' || bi->backend_port <= 0)
+	{
+		/* Fallback if hostname or port is not set */
+		out = palloc(32);
+		snprintf(out, 32, "unknown_node_%d", node_id);
+		return out;
+	}

3. Ditto as above.

+	/* Use hostname:port format */
+	hlen = strlen(bi->backend_hostname);
+	/* max port chars ~5, plus colon and NUL */
+	out_len = hlen + 1 + 5 + 1;
+	out = palloc(out_len);
+	snprintf(out, out_len, "%s:%d", bi->backend_hostname, bi->backend_port);
+	return out;

4. There are a few compiler warnings.

streaming_replication/pool_worker_child.c: In function ‘do_worker_child’:
streaming_replication/pool_worker_child.c:269:33: warning: this ‘else’ clause does not guard... [-Wmisleading-indentation]
  269 |                                 else
      |                                 ^~~~
streaming_replication/pool_worker_child.c:273:41: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the ‘else’
  273 |                                         node_status = verify_backend_node_status(slots);
      |                                         ^~~~~~~~~~~

5. 041.external_replication_delay failed.

timeout: failed to run command './test.sh': Permission denied

After running "chmod 755" to fix the issue, still the test fails. From
src/test/regression/log/041.external_replication_delay:

./test.sh: line 45: syntax error near unexpected token `('
./test.sh: line 45: `echo === Test0: External command receives replica identifiers only (primary omitted) ==='

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-11-02 11:23  Nadav Shatz <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Nadav Shatz @ 2025-11-02 11:23 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

thanks and sorry for the issues, please find attached updated version.

On Sat, Nov 1, 2025 at 8:36 AM Tatsuo Ishii <[email protected]> wrote:

> >> Hi,
> >>
> >> I'm back at work - wdyt of this version?
> >
> > Thanks for the patch! I will look into it weekend.
>
> Here are review comments.
>
> 1. git apply complains trailing whitespace and new blank line.
>
> $ git apply ~/0001-external-replication-delay-implementation.patch
> /home/t-ishii/0001-external-replication-delay-implementation.patch:225:
> trailing whitespace.
>
> /home/t-ishii/0001-external-replication-delay-implementation.patch:232:
> trailing whitespace.
>
> /home/t-ishii/0001-external-replication-delay-implementation.patch:246:
> trailing whitespace.
>
> warning: 3 lines add whitespace errors.
> $ git apply ~/0002-external-replication-delay-tests-and-docs.patch
> /home/t-ishii/0002-external-replication-delay-tests-and-docs.patch:639:
> new blank line at EOF.
> +
> warning: 1 line adds whitespace errors.
>
> 2. You can use psprintf() instead of palloc() + snprintf() to make the
> code simpler.
>
> +       if (!bi || bi->backend_hostname[0] == '\0' || bi->backend_port <=
> 0)
> +       {
> +               /* Fallback if hostname or port is not set */
> +               out = palloc(32);
> +               snprintf(out, 32, "unknown_node_%d", node_id);
> +               return out;
> +       }
>
> 3. Ditto as above.
>
> +       /* Use hostname:port format */
> +       hlen = strlen(bi->backend_hostname);
> +       /* max port chars ~5, plus colon and NUL */
> +       out_len = hlen + 1 + 5 + 1;
> +       out = palloc(out_len);
> +       snprintf(out, out_len, "%s:%d", bi->backend_hostname,
> bi->backend_port);
> +       return out;
>
> 4. There are a few compiler warnings.
>
> streaming_replication/pool_worker_child.c: In function ‘do_worker_child’:
> streaming_replication/pool_worker_child.c:269:33: warning: this ‘else’
> clause does not guard... [-Wmisleading-indentation]
>   269 |                                 else
>       |                                 ^~~~
> streaming_replication/pool_worker_child.c:273:41: note: ...this statement,
> but the latter is misleadingly indented as if it were guarded by the ‘else’
>   273 |                                         node_status =
> verify_backend_node_status(slots);
>       |                                         ^~~~~~~~~~~
>
> 5. 041.external_replication_delay failed.
>
> timeout: failed to run command './test.sh': Permission denied
>
> After running "chmod 755" to fix the issue, still the test fails. From
> src/test/regression/log/041.external_replication_delay:
>
> ./test.sh: line 45: syntax error near unexpected token `('
> ./test.sh: line 45: `echo === Test0: External command receives replica
> identifiers only (primary omitted) ==='
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>


-- 
Nadav Shatz
Tailor Brands | CTO


Attachments:

  [application/octet-stream] external-replication-delay-full.patch (48.1K, 3-external-replication-delay-full.patch)
  download | inline diff:
From 9ae36ec00e140a2cc898ed92be928d9dc731f75a Mon Sep 17 00:00:00 2001
From: Nadav Shatz <[email protected]>
Date: Sun, 2 Nov 2025 13:08:33 +0200
Subject: [PATCH] feat: external replication delay injection via external
 command

Implementation:
- Add replication_delay_source_cmd configuration string option
- Command receives replica node identifiers in host:port format
- Primary node omitted from command arguments and output
- Handle -1 for down nodes (log without triggering failover)
- Command outputs one delay value (ms) per replica
- Falls back to builtin queries if command not configured
- Timeout handling with replication_delay_source_timeout
- Use psprintf() for cleaner code
- Fix indentation and remove trailing whitespace

Tests:
- Verify command receives replicas only (primary omitted)
- Verify host:port identifier format
- Test -1 handling for down nodes
- Test integer and float delay values
- Test validation, timeouts, and error handling
- Test wrong output counts and edge cases

Documentation:
- Document replication_delay_source_cmd with replica-only semantics
- Document -1 for down nodes
- Provide examples with correct output format
- Update replication_delay_source_timeout docs
---
 doc/src/sgml/stream-check.sgml                |  68 +++
 src/config/pool_config_variables.c            |  21 +
 src/include/pool_config.h                     |   3 +-
 src/sample/pgpool.conf.sample-stream          |  14 +
 src/streaming_replication/pool_worker_child.c | 336 ++++++++++++++-
 .../041.external_replication_delay/README     |  59 +++
 .../041.external_replication_delay/test.sh    | 401 ++++++++++++++++++
 .../test_parsing.sh                           |  54 +++
 .../test_validation.sh                        | 323 ++++++++++++++
 9 files changed, 1274 insertions(+), 5 deletions(-)
 create mode 100644 src/test/regression/tests/041.external_replication_delay/README
 create mode 100755 src/test/regression/tests/041.external_replication_delay/test.sh
 create mode 100755 src/test/regression/tests/041.external_replication_delay/test_parsing.sh
 create mode 100755 src/test/regression/tests/041.external_replication_delay/test_validation.sh

diff --git a/doc/src/sgml/stream-check.sgml b/doc/src/sgml/stream-check.sgml
index d2ca3ca49..fc4799080 100644
--- a/doc/src/sgml/stream-check.sgml
+++ b/doc/src/sgml/stream-check.sgml
@@ -309,6 +309,74 @@ GRANT pg_monitor TO sr_check_user;
     </listitem>
   </varlistentry>
 
+  <varlistentry id="guc-replication-delay-source-cmd" xreflabel="replication_delay_source_cmd">
+   <term><varname>replication_delay_source_cmd</varname> (<type>string</type>)
+    <indexterm>
+     <primary><varname>replication_delay_source_cmd</varname> configuration parameter</primary>
+    </indexterm>
+   </term>
+   <listitem>
+    <para>
+     Specifies an external command to retrieve replication delay information for replica nodes.
+     When this parameter is set and not empty, <productname>Pgpool-II</productname> uses the
+     external command instead of built-in database queries to obtain replication delays.
+     The command is executed as the <productname>Pgpool-II</productname> process user.
+    </para>
+    <para>
+     The command receives replica node identifiers as positional arguments, with the primary
+     node omitted. Each identifier is in the format <literal>&lt;hostname&gt;:&lt;port&gt;</literal>,
+     for example <literal>server1:5432 server2:5432</literal>. The order matches
+     <productname>Pgpool-II</productname>'s backend order (excluding the primary), allowing the
+     script to correlate external metrics (such as from AWS CloudWatch for Aurora) to the correct nodes.
+    </para>
+    <para>
+     The command must write a single line to stdout containing one whitespace-separated delay value
+     per replica, in milliseconds, in the same order as the arguments. The primary node's delay is
+     implicitly zero and should not be included in the output. Delay values can be integers or
+     floating-point numbers.
+    </para>
+    <para>
+     Special value: <literal>-1</literal> indicates a replica that is down but not yet detected
+     by <productname>Pgpool-II</productname>'s health checks. <productname>Pgpool-II</productname>
+     will log this condition but rely on its own health-check logic to decide whether to trigger
+     failover; no failover is triggered solely by receiving <literal>-1</literal>.
+    </para>
+    <para>
+     Example for a 3-node cluster (1 primary + 2 replicas): if the command receives arguments
+     <literal>server1:5432 server2:5432</literal>, it should output <literal>"25.5 100"</literal>
+     to indicate the first replica has 25.5ms delay and the second has 100ms delay.
+    </para>
+    <para>
+     Default is empty (use built-in replication delay queries).
+    </para>
+    <para>
+     This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+    </para>
+   </listitem>
+  </varlistentry>
+
+  <varlistentry id="guc-replication-delay-source-timeout" xreflabel="replication_delay_source_timeout">
+   <term><varname>replication_delay_source_timeout</varname> (<type>integer</type>)
+    <indexterm>
+     <primary><varname>replication_delay_source_timeout</varname> configuration parameter</primary>
+    </indexterm>
+   </term>
+   <listitem>
+    <para>
+     Specifies the timeout in seconds for the external command specified by
+     <xref linkend="guc-replication-delay-source-cmd">.
+     If the command does not finish within the timeout, <productname>Pgpool-II</productname>
+     logs an error and continues using the built-in method.
+    </para>
+    <para>
+     Default is 10 seconds. Valid range is 1-3600 seconds.
+    </para>
+    <para>
+     This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+    </para>
+   </listitem>
+  </varlistentry>
+
   <varlistentry id="guc-log-standby-delay" xreflabel="log_standby_delay">
    <term><varname>log_standby_delay</varname> (<type>enum</type>)
     <indexterm>
diff --git a/src/config/pool_config_variables.c b/src/config/pool_config_variables.c
index 62a05979a..a35d2200f 100644
--- a/src/config/pool_config_variables.c
+++ b/src/config/pool_config_variables.c
@@ -980,6 +980,16 @@ static struct config_string ConfigureNamesString[] =
 		NULL, NULL, NULL, NULL
 	},
 
+	{
+		{"replication_delay_source_cmd", CFGCXT_RELOAD, STREAMING_REPLICATION_CONFIG,
+			"External command to retrieve replication delay information.",
+			CONFIG_VAR_TYPE_STRING, false, 0
+		},
+		&g_pool_config.replication_delay_source_cmd,
+		"",
+		NULL, NULL, NULL, NULL
+	},
+
 	{
 		{"failback_command", CFGCXT_RELOAD, FAILOVER_CONFIG,
 			"Command to execute when backend node is attached.",
@@ -2334,6 +2344,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"replication_delay_source_timeout", CFGCXT_RELOAD, STREAMING_REPLICATION_CONFIG,
+			"Timeout for external replication delay command execution in seconds.",
+			CONFIG_VAR_TYPE_INT, false, 0
+		},
+		&g_pool_config.replication_delay_source_timeout,
+		10,
+		1, 3600,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	EMPTY_CONFIG_INT
 };
diff --git a/src/include/pool_config.h b/src/include/pool_config.h
index 9160a31c8..5bc646805 100644
--- a/src/include/pool_config.h
+++ b/src/include/pool_config.h
@@ -86,7 +86,6 @@ typedef enum LogStandbyDelayModes
 	LSD_NONE
 } LogStandbyDelayModes;
 
-
 typedef enum MemCacheMethod
 {
 	SHMEM_CACHE = 1,
@@ -363,6 +362,8 @@ typedef struct
 	char	   *sr_check_password;	/* password for sr_check_user */
 	char	   *sr_check_database;	/* PostgreSQL database name for streaming
 									 * replication check */
+	char	   *replication_delay_source_cmd;	/* external command for replication delay */
+	int			replication_delay_source_timeout;	/* timeout for external command in seconds */
 	char	   *failover_command;	/* execute command when failover happens */
 	char	   *follow_primary_command; /* execute command when failover is
 										 * ended */
diff --git a/src/sample/pgpool.conf.sample-stream b/src/sample/pgpool.conf.sample-stream
index ba6b923b0..34462fd59 100644
--- a/src/sample/pgpool.conf.sample-stream
+++ b/src/sample/pgpool.conf.sample-stream
@@ -519,6 +519,20 @@ backend_clustering_mode = streaming_replication
 
 #sr_check_database = 'postgres'
                                    # Database name for streaming replication check
+
+#replication_delay_source_cmd = ''
+                                   # External command to retrieve replication delay information
+                                   # If set, pgpool uses this command instead of built-in queries
+                                   # Command receives replica node identifiers (host:port) as arguments
+                                   # Primary node is omitted from arguments
+                                   # Command should output one delay value (in ms) per replica
+                                   # Use -1 to indicate a replica that is down but not yet detected
+                                   # Format: "25 100" for 2 replicas (e.g., 3-node cluster with 1 primary)
+                                   # Command runs as the pgpool process user
+#replication_delay_source_timeout = 10
+                                   # Timeout for external command execution in seconds
+                                   # Range: 1-3600 seconds (default: 10)
+
 #delay_threshold = 0
                                    # Threshold before not dispatching query to standby node
                                    # Unit is in bytes
diff --git a/src/streaming_replication/pool_worker_child.c b/src/streaming_replication/pool_worker_child.c
index 5bf19c37d..81dc82922 100644
--- a/src/streaming_replication/pool_worker_child.c
+++ b/src/streaming_replication/pool_worker_child.c
@@ -76,6 +76,8 @@ static volatile sig_atomic_t restart_request = 0;
 static void establish_persistent_connection(void);
 static void discard_persistent_connection(void);
 static void check_replication_time_lag(void);
+static void check_replication_time_lag_with_cmd(void);
+static char *build_instance_identifier_for_node(int node_id);
 static void CheckReplicationTimeLagErrorCb(void *arg);
 static unsigned long long int text_to_lsn(char *text);
 static RETSIGTYPE my_signal_handler(int sig);
@@ -259,11 +261,16 @@ do_worker_child(void *params)
 					POOL_NODE_STATUS *node_status;
 					int			i;
 
-					/* Do replication time lag checking */
-					check_replication_time_lag();
+			/* Do replication time lag checking */
+			/* Use external command if replication_delay_source_cmd is configured */
+			if (pool_config->replication_delay_source_cmd &&
+				strlen(pool_config->replication_delay_source_cmd) > 0)
+				check_replication_time_lag_with_cmd();
+			else
+				check_replication_time_lag();
 
-					/* Check node status */
-					node_status = verify_backend_node_status(slots);
+			/* Check node status */
+			node_status = verify_backend_node_status(slots);
 
 
 					for (i = 0; i < NUM_BACKENDS; i++)
@@ -659,6 +666,327 @@ check_replication_time_lag(void)
 	error_context_stack = callback.previous;
 }
 
+#define MAX_CMD_OUTPUT 4096
+#define MAX_REASONABLE_DELAY_MS 3600000.0  /* 1 hour in milliseconds */
+
+/* Global variable to track command timeout */
+static volatile sig_atomic_t command_timeout_occurred = 0;
+
+/*
+ * Signal handler for command timeou
+ */
+static void
+command_timeout_handler(int sig)
+{
+	command_timeout_occurred = 1;
+}
+
+
+
+/*
+ * Check replication time lag using external command
+ *
+ * The external command receives only replica (standby) node identifiers as arguments,
+ * omitting the primary node. It returns delay values in milliseconds for each replica.
+ * A value of -1 indicates a node that is down but not yet detected by pgpool's health checks.
+ */
+static void
+check_replication_time_lag_with_cmd(void)
+{
+	FILE		   *fp;
+	char		   *command;
+	char		   *line;
+	char		   *token;
+	char		   *saveptr;
+	int				replica_idx;
+	int				num_replicas;
+	double			delay_ms;
+	uint64			delay;
+	int				token_count = 0;
+	BackendInfo	   *bkinfo;
+	ErrorContextCallback callback;
+
+	if (NUM_BACKENDS <= 1)
+	{
+		/* If there's only one node, there's no point to do checking */
+		return;
+	}
+
+	if (REAL_PRIMARY_NODE_ID < 0)
+	{
+		/* No need to check if there's no primary */
+		return;
+	}
+
+	if (!VALID_BACKEND(REAL_PRIMARY_NODE_ID))
+	{
+		/* No need to check replication delay if primary is down */
+		return;
+	}
+
+	if (!pool_config->replication_delay_source_cmd ||
+		strlen(pool_config->replication_delay_source_cmd) == 0)
+	{
+		ereport(WARNING,
+				(errmsg("replication_delay_source_cmd is not configured"),
+				 errhint("Set replication_delay_source_cmd to use external command mode")));
+		/* Fall back to builtin method */
+		check_replication_time_lag();
+		return;
+	}
+
+	/* Allocate buffer for command output */
+	line = palloc(MAX_CMD_OUTPUT);
+	fp = NULL;
+
+	/*
+	 * Register a error context callback to throw proper context message
+	 */
+	callback.callback = CheckReplicationTimeLagErrorCb;
+	callback.arg = NULL;
+	callback.previous = error_context_stack;
+	error_context_stack = &callback;
+
+	/* Execute command as current process user */
+	PG_TRY();
+	{
+		const char *base_command = pool_config->replication_delay_source_cmd;
+		size_t total_len = strlen(base_command) + 1; /* +1 for NUL */
+
+		/* Build command with replica-only arguments (omit primary) */
+		/* Calculate total command length including space-separated replica identifiers */
+		for (int i = 0; i < NUM_BACKENDS; i++)
+		{
+			if (i == REAL_PRIMARY_NODE_ID)
+				continue; /* Skip primary node */
+
+			char *ident = build_instance_identifier_for_node(i);
+			total_len += 1 /* space */ + strlen(ident);
+			pfree(ident);
+		}
+
+		command = palloc(total_len);
+		strlcpy(command, base_command, total_len);
+
+		/* Append replica identifiers */
+		for (int i = 0; i < NUM_BACKENDS; i++)
+		{
+			if (i == REAL_PRIMARY_NODE_ID)
+				continue; /* Skip primary node */
+
+			char *ident = build_instance_identifier_for_node(i);
+			strlcat(command, " ", total_len);
+			strlcat(command, ident, total_len);
+			pfree(ident);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("executing replication delay command: %s", command)));
+
+		/* Set up timeout for command execution */
+		command_timeout_occurred = 0;
+		signal(SIGALRM, command_timeout_handler);
+		alarm(pool_config->replication_delay_source_timeout);
+
+		fp = popen(command, "r");
+		if (fp == NULL)
+		{
+			alarm(0); /* Cancel alarm */
+			signal(SIGALRM, SIG_DFL);
+			ereport(ERROR,
+					(errmsg("failed to execute replication delay command: %s", command),
+					 errdetail("popen failed: %m")));
+		}
+
+		if (fgets(line, MAX_CMD_OUTPUT, fp) == NULL)
+		{
+			int pclose_result = pclose(fp);
+			fp = NULL;
+			alarm(0); /* Cancel alarm */
+			signal(SIGALRM, SIG_DFL);
+
+			if (command_timeout_occurred)
+			{
+				ereport(ERROR,
+						(errmsg("replication delay command timed out after %d seconds: %s",
+								pool_config->replication_delay_source_timeout, command),
+						 errhint("Consider increasing replication_delay_source_timeout or optimizing the command")));
+			}
+			else
+			{
+				ereport(ERROR,
+						(errmsg("failed to read output from replication delay command: %s", command),
+						 errdetail("command exit status: %d", pclose_result)));
+			}
+		}
+
+		alarm(0); /* Cancel alarm */
+		signal(SIGALRM, SIG_DFL);
+
+		/* Check if output was truncated */
+		if (strlen(line) == MAX_CMD_OUTPUT - 1 && line[MAX_CMD_OUTPUT - 2] != '\n')
+		{
+			ereport(WARNING,
+					(errmsg("replication delay command output may have been truncated")));
+		}
+
+		pclose(fp);
+		fp = NULL;
+		pfree(command);
+		command = NULL;
+
+		/* Set primary node delay to 0 */
+		bkinfo = pool_get_node_info(REAL_PRIMARY_NODE_ID);
+		bkinfo->standby_delay = 0;
+		bkinfo->standby_delay_by_time = true;
+
+		/* Count expected replicas */
+		num_replicas = NUM_BACKENDS - 1; /* Total nodes minus primary */
+
+		/* Count tokens in output for validation */
+		char *line_copy = pstrdup(line);
+		char *temp_token = strtok(line_copy, " \t\n");
+		while (temp_token != NULL)
+		{
+			token_count++;
+			temp_token = strtok(NULL, " \t\n");
+		}
+		pfree(line_copy);
+
+		if (token_count != num_replicas)
+		{
+			ereport(WARNING,
+					(errmsg("replication delay command returned %d values, expected %d (one per replica, excluding primary)",
+							token_count, num_replicas),
+					 errhint("Command should output one delay value per replica node")));
+		}
+
+		/* Parse the output - one delay value per replica in order */
+		token = strtok_r(line, " \t\n", &saveptr);
+		replica_idx = 0;
+
+		for (int i = 0; i < NUM_BACKENDS && token != NULL; i++)
+		{
+			if (i == REAL_PRIMARY_NODE_ID)
+				continue; /* Skip primary - it's not in the output */
+
+			if (!VALID_BACKEND(i))
+			{
+				/* Skip invalid backend but consume token */
+				token = strtok_r(NULL, " \t\n", &saveptr);
+				replica_idx++;
+				continue;
+			}
+
+			char *endptr;
+			delay_ms = strtod(token, &endptr);
+
+			/* Validate the conversion */
+			if (*endptr != '\0')
+			{
+				ereport(WARNING,
+						(errmsg("invalid delay value '%s' for node %d, treating as 0",
+								token, i)));
+				delay_ms = 0;
+			}
+
+			bkinfo = pool_get_node_info(i);
+
+			/* Handle -1 for down nodes */
+			if (delay_ms == -1.0)
+			{
+				ereport(LOG,
+						(errmsg("node %d reported as down by external command (delay -1), relying on health check for failover decision",
+								i)));
+				/* Keep previous delay value, don't trigger failover */
+				token = strtok_r(NULL, " \t\n", &saveptr);
+				replica_idx++;
+				continue;
+			}
+
+			/* Validate delay value range */
+			if (delay_ms < 0)
+			{
+				ereport(WARNING,
+						(errmsg("negative delay value %.3f for node %d (other than -1), treating as 0",
+								delay_ms, i)));
+				delay_ms = 0;
+			}
+			else if (delay_ms > MAX_REASONABLE_DELAY_MS)
+			{
+				ereport(WARNING,
+						(errmsg("extremely large delay value %.3f for node %d",
+								delay_ms, i)));
+			}
+
+			/* Convert delay from milliseconds to microseconds for internal storage */
+			delay = (uint64)(delay_ms * 1000);
+			bkinfo->standby_delay = delay;
+			bkinfo->standby_delay_by_time = true;
+
+			/* Log delay if necessary */
+			uint64 delay_threshold_by_time = pool_config->delay_threshold_by_time * 1000; /* threshold is in milliseconds, convert to microseconds */
+
+			if ((pool_config->log_standby_delay == LSD_ALWAYS && delay_ms > 0) ||
+				(pool_config->log_standby_delay == LSD_OVER_THRESHOLD &&
+				 bkinfo->standby_delay > delay_threshold_by_time))
+			{
+				ereport(LOG,
+						(errmsg("Replication of node: %d is behind %.3f second(s) from the primary server (node: %d) [external command]",
+								i, delay_ms / 1000, REAL_PRIMARY_NODE_ID)));
+			}
+
+			token = strtok_r(NULL, " \t\n", &saveptr);
+			replica_idx++;
+		}
+
+	}
+	PG_CATCH();
+	{
+		/* Cleanup in case of error */
+		alarm(0); /* Cancel any pending alarm */
+		signal(SIGALRM, SIG_DFL);
+		if (fp)
+		{
+			pclose(fp);
+			fp = NULL;
+		}
+		if (line)
+			pfree(line);
+		if (command)
+			pfree(command);
+		error_context_stack = callback.previous;
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	/* Normal cleanup */
+	if (line)
+		pfree(line);
+
+	error_context_stack = callback.previous;
+}
+
+/*
+ * build_instance_identifier_for_node
+ *  Build an identifier string for a backend node for passing to external commands.
+ *  Format: "<hostname>:<port>"
+ */
+static char *
+build_instance_identifier_for_node(int node_id)
+{
+	BackendInfo *bi = pool_get_node_info(node_id);
+
+	if (!bi || bi->backend_hostname[0] == '\0' || bi->backend_port <= 0)
+	{
+		/* Fallback if hostname or port is not set */
+		return psprintf("unknown_node_%d", node_id);
+	}
+
+	/* Use hostname:port format */
+	return psprintf("%s:%d", bi->backend_hostname, bi->backend_port);
+}
+
 static void
 CheckReplicationTimeLagErrorCb(void *arg)
 {
diff --git a/src/test/regression/tests/041.external_replication_delay/README b/src/test/regression/tests/041.external_replication_delay/README
new file mode 100644
index 000000000..b4df5da40
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/README
@@ -0,0 +1,59 @@
+External Replication Delay Command Test
+========================================
+
+This test verifies the external command replication delay source feature.
+
+Test Coverage:
+- External command receives replica node identifiers only (primary omitted)
+- Instance identifiers in host:port format
+- Basic external command execution with integer and float millisecond values
+- Delay threshold functionality with external commands
+- Command execution as pgpool process user (no su wrapper)
+- Error handling for missing/invalid commands
+- Command execution failure scenarios
+- Command timeout handling with configurable timeout values
+- Input validation for invalid, negative (other than -1), and extremely large delay values
+- Handling of -1 for down nodes (logged but no immediate failover)
+- Wrong number of output values validation
+- Multiple -1 values (multiple down replicas)
+- Mixed scenarios (some replicas up, some down)
+- Output truncation detection
+
+Files:
+- test.sh: Main test script
+- test_parsing.sh: Unit test for parsing logic
+- test_validation.sh: Validation and edge case testing
+- README: This documentation
+
+Key Changes from Original Version:
+- Primary node is omitted from command arguments
+- Command receives only replica identifiers
+- Instance identifiers are in host:port format (not application_name)
+- Output format: one delay per replica (not per all nodes)
+- -1 value indicates down replica without triggering failover
+- Format example: "25 100" for 2 replicas (3-node cluster = 1 primary + 2 replicas)
+
+The test creates temporary command scripts that output delay values in the format:
+"replica1_delay replica2_delay ..."
+
+Where delays are in milliseconds and can be integer or floating-point values.
+Special value -1 indicates a replica that is down but not yet detected by pgpool.
+
+Test Environment:
+- Uses streaming replication mode with 3 nodes
+- Node 0 is primary (omitted from command arguments)
+- Nodes 1 and 2 are replicas (included in command arguments)
+- Configures sr_check_period = 1 second for faster testing
+- Tests various delay scenarios and threshold behaviors
+
+Expected Behavior:
+- External commands receive replica identifiers in host:port format
+- Primary node identifier is never passed to command
+- Command outputs one delay value per replica
+- -1 values are logged but don't trigger immediate failover
+- Delay values are parsed correctly (both int and float)
+- Threshold comparisons work properly
+- Error conditions are handled gracefully
+- Commands timeout appropriately based on configuration
+- Timeout errors provide helpful messages and hints
+- Tests are reliable with proper wait mechanisms instead of fixed sleeps
diff --git a/src/test/regression/tests/041.external_replication_delay/test.sh b/src/test/regression/tests/041.external_replication_delay/test.sh
new file mode 100755
index 000000000..f02a086b1
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/test.sh
@@ -0,0 +1,401 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for external command replication delay source
+#
+source $TESTLIBS
+TESTDIR=testdir
+PG_CTL=$PGBIN/pg_ctl
+PSQL="$PGBIN/psql -X "
+
+rm -fr $TESTDIR
+mkdir $TESTDIR
+cd $TESTDIR
+
+# create test environment
+echo -n "creating test environment..."
+$PGPOOL_SETUP -m s -n 3 || exit 1
+echo "done."
+source ./bashrc.ports
+export PGPORT=$PGPOOL_PORT
+
+# Create external command scripts for testing
+# NOTE: Commands now only output delay values for REPLICAS (not primary)
+cat > delay_cmd_static.sh << 'EOF'
+#!/bin/bash
+# Static delay values for replicas: node1=25ms, node2=50ms (node0 is primary, not included)
+echo "25 50"
+EOF
+chmod +x delay_cmd_static.sh
+
+cat > delay_cmd_float.sh << 'EOF'
+#!/bin/bash
+# Float delay values for replicas: node1=25.5ms, node2=100.75ms
+echo "25.5 100.75"
+EOF
+chmod +x delay_cmd_float.sh
+
+cat > delay_cmd_high.sh << 'EOF'
+#!/bin/bash
+# High delay values to test threshold: node1=2000ms, node2=3000ms
+echo "2000 3000"
+EOF
+chmod +x delay_cmd_high.sh
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test0: External command receives replica identifiers only (primary omitted) ==="
+# ----------------------------------------------------------------------------------------
+# Command that captures its arguments and outputs valid delays for 2 replicas
+cat > delay_cmd_args.sh << 'EOF'
+#!/bin/bash
+printf "%s " "$@" > args.txt
+echo "25 50"
+EOF
+chmod +x delay_cmd_args.sh
+
+echo "replication_delay_source_cmd = './delay_cmd_args.sh'" >> etc/pgpool.conf
+echo "sr_check_period = 1" >> etc/pgpool.conf
+echo "log_min_messages = 'DEBUG1'" >> etc/pgpool.conf
+# Reduce memory requirements for macOS shared memory limits
+echo "num_init_children = 4" >> etc/pgpool.conf
+echo "max_pool = 2" >> etc/pgpool.conf
+# Disable query caching to avoid shared memory issues on macOS
+echo "memory_cache_enabled = off" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+echo "Waiting for sr_check to pass args..."
+for i in {1..10}; do
+    if [ -f args.txt ]; then
+        break
+    fi
+    sleep 1
+done
+
+if [ ! -f args.txt ]; then
+    echo fail: did not capture command arguments
+    ./shutdownall
+    exit 1
+fi
+
+ARGS_CONTENT=$(cat args.txt | sed 's/[[:space:]]*$//')
+# Should receive 2 replica identifiers in host:port format (localhost:11003 localhost:11004 or server1:11003 server2:11004)
+# Primary (localhost:11002 or server0:11002) should be omitted
+if ! echo "$ARGS_CONTENT" | grep -qE "(server1|localhost):11003"; then
+    echo "fail: expected replica1:11003 in arguments, got: '$ARGS_CONTENT'"
+    ./shutdownall
+    exit 1
+fi
+if ! echo "$ARGS_CONTENT" | grep -qE "(server2|localhost):11004"; then
+    echo "fail: expected replica2:11004 in arguments, got: '$ARGS_CONTENT'"
+    ./shutdownall
+    exit 1
+fi
+if echo "$ARGS_CONTENT" | grep -qE "(server0|localhost):11002"; then
+    echo "fail: primary should not be in arguments, got: '$ARGS_CONTENT'"
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: argument order correct - replicas only, primary omitted, host:port format
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test1: Basic external command with integer millisecond values ==="
+# ----------------------------------------------------------------------------------------
+echo "replication_delay_source_cmd = './delay_cmd_static.sh'" >> etc/pgpool.conf
+echo "sr_check_period = 1" >> etc/pgpool.conf
+echo "log_standby_delay = 'always'" >> etc/pgpool.conf
+echo "log_min_messages = 'DEBUG1'" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+$PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+EOF
+
+# Wait for sr_check to run and populate delay values
+# sr_check_period is 1 second, so wait a bit longer to ensure it runs
+echo "Waiting for sr_check to run..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command" log/pgpool.log 2>/dev/null; then
+        echo "Command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SHOW POOL_NODES;
+EOF
+
+# Check that delay values are populated in the log
+grep "executing replication delay command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: external command was not executed
+    echo "Log contents:"
+    tail -20 log/pgpool.log
+    ./shutdownall
+    exit 1
+fi
+
+# Verify actual delay values were parsed
+if ! $PSQL -t -c "SHOW POOL_NODES" test | grep -E "[0-9]+\.[0-9]+" >/dev/null; then
+    echo "Warning: No delay values found in POOL_NODES output"
+fi
+
+# Check for delay log messages
+grep "Replication of node.*external command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: external command delay logging not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: basic external command test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test2: External command with floating-point millisecond values ==="
+# ----------------------------------------------------------------------------------------
+# Update configuration to use float command
+sed -i.bak "s|delay_cmd_static.sh|delay_cmd_float.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run with float values
+echo "Waiting for sr_check with float values..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*delay_cmd_float.sh" log/pgpool.log 2>/dev/null; then
+        echo "Float command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SHOW POOL_NODES;
+EOF
+
+# Check that float values are handled correctly
+grep "executing replication delay command.*delay_cmd_float.sh" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: float command was not executed
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: floating-point values test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test3: External command with delay threshold ==="
+# ----------------------------------------------------------------------------------------
+# Update configuration to use high delay command and set threshold
+sed -i.bak "s|delay_cmd_float.sh|delay_cmd_high.sh|" etc/pgpool.conf
+echo "delay_threshold_by_time = 1000" >> etc/pgpool.conf
+echo "backend_weight0 = 0" >> etc/pgpool.conf  # Force queries to standby normally
+echo "backend_weight2 = 0" >> etc/pgpool.conf  # Only use node 1 as standby
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run and detect high delays
+echo "Waiting for sr_check with high delay values..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*delay_cmd_high.sh" log/pgpool.log 2>/dev/null; then
+        echo "High delay command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SELECT * FROM t1 LIMIT 1;
+EOF
+
+# With high delays (2000ms > 1000ms threshold), query should go to primary (node 0)
+grep "SELECT \* FROM t1 LIMIT 1.*DB node id: 0" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: query was not sent to primary node despite high delay
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: delay threshold test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test4: External command execution as process user ==="
+# ----------------------------------------------------------------------------------------
+# Test that command runs as the current pgpool process user
+sed -i.bak "s|delay_cmd_high.sh|delay_cmd_static.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for sr_check to run as process user..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*delay_cmd_static.sh" log/pgpool.log 2>/dev/null; then
+        echo "Command executed as process user after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check that command was executed (without su wrapper)
+grep "executing replication delay command.*delay_cmd_static.sh" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: command was not executed as process user
+    ./shutdownall
+    exit 1
+fi
+
+# Verify no su command was used
+if grep -q "executing replication delay command.*su.*" log/pgpool.log 2>/dev/null; then
+    echo fail: command should not use su wrapper
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: process user execution test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test5: Error handling - missing command ==="
+# ----------------------------------------------------------------------------------------
+# Test error handling when command is not configured
+sed -i.bak "s|replication_delay_source_cmd = './delay_cmd_static.sh'|replication_delay_source_cmd = ''|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# With empty command, should fall back to builtin method
+# No specific error message expected - just verify it doesn't crash
+sleep 3
+
+echo ok: empty command test succeeded (fallback to builtin)
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test6: Error handling - command execution failure ==="
+# ----------------------------------------------------------------------------------------
+# Test error handling when command fails
+echo "replication_delay_source_cmd = './nonexistent_command.sh'" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run with failing command
+echo "Waiting for sr_check with failing command..."
+for i in {1..5}; do
+    if grep -q "failed to execute replication delay command" log/pgpool.log 2>/dev/null; then
+        echo "Command failure detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for error message about command execution failure
+grep "failed to execute replication delay command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: command execution failure not detected
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: command failure test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test7: Command timeout handling ==="
+# ----------------------------------------------------------------------------------------
+# Create a command that takes longer than the timeout
+cat > delay_cmd_slow.sh << 'EOF'
+#!/bin/bash
+# Slow command that takes 15 seconds (longer than default 10s timeout)
+sleep 15
+echo "25 50"
+EOF
+chmod +x delay_cmd_slow.sh
+
+# Set a short timeout and use the slow command
+sed -i.bak "s|replication_delay_source_cmd = './nonexistent_command.sh'|replication_delay_source_cmd = './delay_cmd_slow.sh'|" etc/pgpool.conf
+echo "replication_delay_source_timeout = 3" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run and timeout
+echo "Waiting for command timeout..."
+for i in {1..15}; do
+    if grep -q "replication delay command timed out" log/pgpool.log 2>/dev/null; then
+        echo "Command timeout detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for timeout error message
+grep "replication delay command timed out after 3 seconds" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: command timeout not detected
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: command timeout test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test8: Handling of -1 for down nodes ==="
+# ----------------------------------------------------------------------------------------
+# Create a command that returns -1 for one replica
+cat > delay_cmd_with_down_node.sh << 'EOF'
+#!/bin/bash
+# Return -1 for first replica (indicating it's down), normal value for second
+echo "-1 50"
+EOF
+chmod +x delay_cmd_with_down_node.sh
+
+# Reset config
+rm -f etc/pgpool.conf.bak
+sed -i.bak "s|delay_cmd_slow.sh|delay_cmd_with_down_node.sh|" etc/pgpool.conf
+sed -i.bak "s|replication_delay_source_timeout = 3|replication_delay_source_timeout = 10|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to process -1 value
+echo "Waiting for sr_check to process -1 value..."
+for i in {1..10}; do
+    if grep -q "node.*reported as down by external command.*delay -1" log/pgpool.log 2>/dev/null; then
+        echo "-1 handling detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for -1 logging message
+grep "node.*reported as down by external command.*delay -1.*relying on health check" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: -1 handling message not found
+    ./shutdownall
+    exit 1
+fi
+
+# Verify that pgpool didn't crash or trigger failover just from -1
+if grep -q "failover" log/pgpool.log 2>/dev/null; then
+    echo "fail: -1 should not trigger immediate failover"
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: -1 handling test succeeded
+./shutdownall
+
+echo "All external replication delay tests passed!"
+exit 0
diff --git a/src/test/regression/tests/041.external_replication_delay/test_parsing.sh b/src/test/regression/tests/041.external_replication_delay/test_parsing.sh
new file mode 100755
index 000000000..82fdad144
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/test_parsing.sh
@@ -0,0 +1,54 @@
+#!/bin/bash
+#-------------------------------------------------------------------
+# Unit test for external command parsing logic
+# This tests the parsing without needing a full pgpool setup
+#
+
+echo "=== Testing external command output parsing ==="
+
+# Test 1: Integer values
+echo "Test 1: Integer millisecond values"
+echo "0 25 50" > test_output.txt
+echo "Expected: 0ms, 25ms, 50ms"
+echo "Output: $(cat test_output.txt)"
+echo ""
+
+# Test 2: Float values
+echo "Test 2: Floating-point millisecond values"
+echo "0 25.5 100.75" > test_output_float.txt
+echo "Expected: 0ms, 25.5ms, 100.75ms"
+echo "Output: $(cat test_output_float.txt)"
+echo ""
+
+# Test 3: High precision float values
+echo "Test 3: High precision values"
+echo "0 0.001 999.999" > test_output_precision.txt
+echo "Expected: 0ms, 0.001ms, 999.999ms"
+echo "Output: $(cat test_output_precision.txt)"
+echo ""
+
+# Test 4: Edge case - zero values
+echo "Test 4: All zero values"
+echo "0 0 0" > test_output_zeros.txt
+echo "Expected: 0ms, 0ms, 0ms"
+echo "Output: $(cat test_output_zeros.txt)"
+echo ""
+
+# Test 5: Edge case - large values
+echo "Test 5: Large delay values"
+echo "0 5000 10000" > test_output_large.txt
+echo "Expected: 0ms, 5000ms, 10000ms"
+echo "Output: $(cat test_output_large.txt)"
+echo ""
+
+# Test 6: Mixed integer and float values
+echo "Test 6: Mixed integer and float values"
+echo "0 25 50.5" > test_output_mixed.txt
+echo "Expected: 0ms, 25ms, 50.5ms"
+echo "Output: $(cat test_output_mixed.txt)"
+echo ""
+
+# Cleanup
+rm -f test_output_*.txt
+
+echo "All parsing tests completed. These outputs should be parseable by the external command feature."
diff --git a/src/test/regression/tests/041.external_replication_delay/test_validation.sh b/src/test/regression/tests/041.external_replication_delay/test_validation.sh
new file mode 100755
index 000000000..2cd4a7f0b
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/test_validation.sh
@@ -0,0 +1,323 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for external command validation and edge cases
+#
+source $TESTLIBS
+TESTDIR=testdir_validation
+PG_CTL=$PGBIN/pg_ctl
+PSQL="$PGBIN/psql -X "
+
+rm -fr $TESTDIR
+mkdir $TESTDIR
+cd $TESTDIR
+
+# create test environment
+echo -n "creating test environment..."
+$PGPOOL_SETUP -m s -n 3 || exit 1
+echo "done."
+source ./bashrc.ports
+export PGPORT=$PGPOOL_PORT
+
+# Create test command scripts
+# NOTE: All commands output values for REPLICAS only (primary omitted)
+cat > delay_cmd_validation.sh << 'EOF'
+#!/bin/bash
+# Test validation: output with invalid values for 2 replicas
+echo "invalid_value 50.5"
+EOF
+chmod +x delay_cmd_validation.sh
+
+cat > delay_cmd_negative.sh << 'EOF'
+#!/bin/bash
+# Test negative values (other than -1)
+echo "-25 50"
+EOF
+chmod +x delay_cmd_negative.sh
+
+cat > delay_cmd_large.sh << 'EOF'
+#!/bin/bash
+# Test extremely large values
+echo "9999999 50"
+EOF
+chmod +x delay_cmd_large.sh
+
+cat > delay_cmd_wrong_count.sh << 'EOF'
+#!/bin/bash
+# Test wrong number of values (only 1 instead of 2 for 2 replicas)
+echo "25"
+EOF
+chmod +x delay_cmd_wrong_count.sh
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test1: Validation of invalid delay values ==="
+# ----------------------------------------------------------------------------------------
+echo "replication_delay_source_cmd = './delay_cmd_validation.sh'" >> etc/pgpool.conf
+echo "sr_check_period = 1" >> etc/pgpool.conf
+echo "log_standby_delay = 'always'" >> etc/pgpool.conf
+echo "log_min_messages = 'DEBUG1'" >> etc/pgpool.conf
+# Reduce memory requirements for macOS shared memory limits
+echo "num_init_children = 4" >> etc/pgpool.conf
+echo "max_pool = 2" >> etc/pgpool.conf
+# Disable query caching to avoid shared memory issues on macOS
+echo "memory_cache_enabled = off" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+$PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+EOF
+
+# Wait for sr_check to run
+echo "Waiting for validation test..."
+for i in {1..10}; do
+    if grep -q "invalid delay value" log/pgpool.log 2>/dev/null; then
+        echo "Validation error detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for validation warning
+grep "invalid delay value 'invalid_value' for node" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: validation warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: invalid value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test2: Negative delay values (other than -1) ==="
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_validation.sh|delay_cmd_negative.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for negative value test..."
+for i in {1..10}; do
+    if grep -q "negative delay value.*other than -1" log/pgpool.log 2>/dev/null; then
+        echo "Negative value warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for negative value warning
+grep "negative delay value.*other than -1.*treating as 0" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: negative value warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: negative value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test3: Extremely large delay values ==="
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_negative.sh|delay_cmd_large.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for large value test..."
+for i in {1..10}; do
+    if grep -q "extremely large delay value" log/pgpool.log 2>/dev/null; then
+        echo "Large value warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for large value warning
+grep "extremely large delay value.*for node" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: large value warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: large value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test4: Wrong number of output values ==="
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_large.sh|delay_cmd_wrong_count.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for wrong count test..."
+for i in {1..10}; do
+    if grep -q "returned.*values, expected.*replica" log/pgpool.log 2>/dev/null; then
+        echo "Wrong count warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for wrong count warning
+grep "returned.*values, expected.*replica.*Command should output one delay value per replica" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: wrong count validation test not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: wrong count validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test5: Multiple -1 values ==="
+# ----------------------------------------------------------------------------------------
+cat > delay_cmd_multi_down.sh << 'EOF'
+#!/bin/bash
+# Test multiple replicas down
+echo "-1 -1"
+EOF
+chmod +x delay_cmd_multi_down.sh
+
+sed -i.bak "s|delay_cmd_wrong_count.sh|delay_cmd_multi_down.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for multi-down test..."
+for i in {1..10}; do
+    if grep -q "node.*reported as down by external command" log/pgpool.log 2>/dev/null; then
+        echo "Multiple down nodes detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for multiple -1 handling
+DOWN_COUNT=$(grep -c "node.*reported as down by external command.*delay -1" log/pgpool.log)
+if [ "$DOWN_COUNT" -lt 2 ]; then
+    echo fail: expected 2 down node messages, found $DOWN_COUNT
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: multiple -1 handling test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test6: Command timeout with different timeout values ==="
+# ----------------------------------------------------------------------------------------
+cat > delay_cmd_timeout.sh << 'EOF'
+#!/bin/bash
+# Command that takes 5 seconds
+sleep 5
+echo "25 50"
+EOF
+chmod +x delay_cmd_timeout.sh
+
+# Test with timeout shorter than command duration
+sed -i.bak "s|delay_cmd_multi_down.sh|delay_cmd_timeout.sh|" etc/pgpool.conf
+echo "replication_delay_source_timeout = 2" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for timeout
+echo "Waiting for timeout test (2s timeout, 5s command)..."
+for i in {1..10}; do
+    if grep -q "replication delay command timed out after 2 seconds" log/pgpool.log 2>/dev/null; then
+        echo "Timeout detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for timeout message
+grep "replication delay command timed out after 2 seconds" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: timeout not detected
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: timeout test succeeded
+./shutdownall
+
+# Test with timeout longer than command duration
+sed -i.bak "s|replication_delay_source_timeout = 2|replication_delay_source_timeout = 10|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for successful execution
+echo "Waiting for successful execution (10s timeout, 5s command)..."
+for i in {1..15}; do
+    if grep -q "executing replication delay command.*delay_cmd_timeout.sh" log/pgpool.log 2>/dev/null; then
+        echo "Command executed successfully after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Should not timeout this time
+if grep -q "replication delay command timed out" log/pgpool.log 2>/dev/null; then
+    echo fail: command should not have timed out with 10s timeout
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: extended timeout test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test7: Mix of valid delays and -1 ==="
+# ----------------------------------------------------------------------------------------
+cat > delay_cmd_mixed.sh << 'EOF'
+#!/bin/bash
+# One replica up (25ms), one down (-1)
+echo "25 -1"
+EOF
+chmod +x delay_cmd_mixed.sh
+
+sed -i.bak "s|delay_cmd_timeout.sh|delay_cmd_mixed.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check
+echo "Waiting for mixed delay test..."
+for i in {1..10}; do
+    if grep -q "node.*reported as down by external command" log/pgpool.log 2>/dev/null; then
+        echo "Mixed delay handling detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Should log one -1 and process one normal delay
+grep "node.*reported as down by external command.*delay -1" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: -1 not logged
+    ./shutdownall
+    exit 1
+fi
+
+# Should also log the normal replica delay
+grep "Replication of node.*external command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo "Note: Normal replica delay logging may not be visible with log_standby_delay settings"
+fi
+
+echo ok: mixed delay handling test succeeded
+./shutdownall
+
+echo "All validation tests passed!"
+exit 0
\ No newline at end of file
-- 
2.51.1



^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-11-03 07:05  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2025-11-03 07:05 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

> thanks and sorry for the issues, please find attached updated version.

No problem.

This time the patch applies fine, no compiler warnings.  However,
regression test did not passed here (on Ubuntu 24 LTS if this
matters).  So I looked into
src/test/regression/tests/041.external_replication_delay/test.sh a
little bit and apply attached patch (test.sh.patch). It moved forward
partially but failed at:

fail: command execution failure not detected

Please find attached
src/test/regression/tests/041.external_replication_delay/testdir/pgpool.log
and src/test/regression/log/041.external_replication_delay.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp


Attachments:

  [application/octet-stream] test.sh.patch (1.4K, 2-test.sh.patch)
  download | inline diff:
*** src/test/regression/tests/041.external_replication_delay/test.sh	2025-11-03 15:27:29.630972600 +0900
--- /tmp/test.sh	2025-11-03 15:22:31.503487058 +0900
***************
*** 217,223 ****
  EOF
  
  # With high delays (2000ms > 1000ms threshold), query should go to primary (node 0)
! grep "SELECT \* FROM t1 LIMIT 1.*DB node id: 0" log/pgpool.log >/dev/null 2>&1
  if [ $? != 0 ];then
      echo fail: query was not sent to primary node despite high delay
      ./shutdownall
--- 217,223 ----
  EOF
  
  # With high delays (2000ms > 1000ms threshold), query should go to primary (node 0)
! grep "DB node id: 0 backend pid: [0-9]* statement: SELECT \* FROM t1 LIMIT 1;" log/pgpool.log >/dev/null 2>&1
  if [ $? != 0 ];then
      echo fail: query was not sent to primary node despite high delay
      ./shutdownall
***************
*** 277,283 ****
  # No specific error message expected - just verify it doesn't crash
  sleep 3
  
! echo ok: empty command test succeeded (fallback to builtin)
  ./shutdownall
  
  # ----------------------------------------------------------------------------------------
--- 277,283 ----
  # No specific error message expected - just verify it doesn't crash
  sleep 3
  
! echo "ok: empty command test succeeded (fallback to builtin)"
  ./shutdownall
  
  # ----------------------------------------------------------------------------------------


  [application/octet-stream] pgpool.log (36.8K, 3-pgpool.log)
  download

  [application/octet-stream] 041.external_replication_delay (18.5K, 4-041.external_replication_delay)
  download

^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-11-05 10:37  Nadav Shatz <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Nadav Shatz @ 2025-11-05 10:37 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

Sorry for that - thanks for the patch.

Please find attached a new version

On Mon, Nov 3, 2025 at 9:05 AM Tatsuo Ishii <[email protected]> wrote:

> > thanks and sorry for the issues, please find attached updated version.
>
> No problem.
>
> This time the patch applies fine, no compiler warnings.  However,
> regression test did not passed here (on Ubuntu 24 LTS if this
> matters).  So I looked into
> src/test/regression/tests/041.external_replication_delay/test.sh a
> little bit and apply attached patch (test.sh.patch). It moved forward
> partially but failed at:
>
> fail: command execution failure not detected
>
> Please find attached
> src/test/regression/tests/041.external_replication_delay/testdir/pgpool.log
> and src/test/regression/log/041.external_replication_delay.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>


-- 
Nadav Shatz
Tailor Brands | CTO


Attachments:

  [application/octet-stream] external-replication-delay.mbox (48.4K, 3-external-replication-delay.mbox)
  download | inline diff:
From b26e1e1a65a815b2d8a8cdd79c039ed5a283e341 Mon Sep 17 00:00:00 2001
From: Nadav Shatz <[email protected]>
Date: Sun, 2 Nov 2025 13:08:33 +0200
Subject: [PATCH] feat: external replication delay injection via external
 command

Implementation:
- Add replication_delay_source_cmd configuration string option
- Command receives replica node identifiers in host:port format
- Primary node omitted from command arguments and output
- Handle -1 for down nodes (log without triggering failover)
- Command outputs one delay value (ms) per replica
- Falls back to builtin queries if command not configured
- Timeout handling with replication_delay_source_timeout
- Use psprintf() for cleaner code
- Fix indentation and remove trailing whitespace

Tests:
- Verify command receives replicas only (primary omitted)
- Verify host:port identifier format
- Test -1 handling for down nodes
- Test integer and float delay values
- Test validation, timeouts, and error handling
- Test wrong output counts and edge cases

Documentation:
- Document replication_delay_source_cmd with replica-only semantics
- Document -1 for down nodes
- Provide examples with correct output format
- Update replication_delay_source_timeout docs
---
 doc/src/sgml/stream-check.sgml                |  68 +++
 src/config/pool_config_variables.c            |  21 +
 src/include/pool_config.h                     |   3 +-
 src/sample/pgpool.conf.sample-stream          |  14 +
 src/streaming_replication/pool_worker_child.c | 336 ++++++++++++++-
 .../041.external_replication_delay/README     |  59 +++
 .../041.external_replication_delay/test.sh    | 404 ++++++++++++++++++
 .../test_parsing.sh                           |  54 +++
 .../test_validation.sh                        | 323 ++++++++++++++
 9 files changed, 1277 insertions(+), 5 deletions(-)
 create mode 100644 src/test/regression/tests/041.external_replication_delay/README
 create mode 100755 src/test/regression/tests/041.external_replication_delay/test.sh
 create mode 100755 src/test/regression/tests/041.external_replication_delay/test_parsing.sh
 create mode 100755 src/test/regression/tests/041.external_replication_delay/test_validation.sh

diff --git a/doc/src/sgml/stream-check.sgml b/doc/src/sgml/stream-check.sgml
index d2ca3ca49..fc4799080 100644
--- a/doc/src/sgml/stream-check.sgml
+++ b/doc/src/sgml/stream-check.sgml
@@ -309,6 +309,74 @@ GRANT pg_monitor TO sr_check_user;
     </listitem>
   </varlistentry>
 
+  <varlistentry id="guc-replication-delay-source-cmd" xreflabel="replication_delay_source_cmd">
+   <term><varname>replication_delay_source_cmd</varname> (<type>string</type>)
+    <indexterm>
+     <primary><varname>replication_delay_source_cmd</varname> configuration parameter</primary>
+    </indexterm>
+   </term>
+   <listitem>
+    <para>
+     Specifies an external command to retrieve replication delay information for replica nodes.
+     When this parameter is set and not empty, <productname>Pgpool-II</productname> uses the
+     external command instead of built-in database queries to obtain replication delays.
+     The command is executed as the <productname>Pgpool-II</productname> process user.
+    </para>
+    <para>
+     The command receives replica node identifiers as positional arguments, with the primary
+     node omitted. Each identifier is in the format <literal>&lt;hostname&gt;:&lt;port&gt;</literal>,
+     for example <literal>server1:5432 server2:5432</literal>. The order matches
+     <productname>Pgpool-II</productname>'s backend order (excluding the primary), allowing the
+     script to correlate external metrics (such as from AWS CloudWatch for Aurora) to the correct nodes.
+    </para>
+    <para>
+     The command must write a single line to stdout containing one whitespace-separated delay value
+     per replica, in milliseconds, in the same order as the arguments. The primary node's delay is
+     implicitly zero and should not be included in the output. Delay values can be integers or
+     floating-point numbers.
+    </para>
+    <para>
+     Special value: <literal>-1</literal> indicates a replica that is down but not yet detected
+     by <productname>Pgpool-II</productname>'s health checks. <productname>Pgpool-II</productname>
+     will log this condition but rely on its own health-check logic to decide whether to trigger
+     failover; no failover is triggered solely by receiving <literal>-1</literal>.
+    </para>
+    <para>
+     Example for a 3-node cluster (1 primary + 2 replicas): if the command receives arguments
+     <literal>server1:5432 server2:5432</literal>, it should output <literal>"25.5 100"</literal>
+     to indicate the first replica has 25.5ms delay and the second has 100ms delay.
+    </para>
+    <para>
+     Default is empty (use built-in replication delay queries).
+    </para>
+    <para>
+     This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+    </para>
+   </listitem>
+  </varlistentry>
+
+  <varlistentry id="guc-replication-delay-source-timeout" xreflabel="replication_delay_source_timeout">
+   <term><varname>replication_delay_source_timeout</varname> (<type>integer</type>)
+    <indexterm>
+     <primary><varname>replication_delay_source_timeout</varname> configuration parameter</primary>
+    </indexterm>
+   </term>
+   <listitem>
+    <para>
+     Specifies the timeout in seconds for the external command specified by
+     <xref linkend="guc-replication-delay-source-cmd">.
+     If the command does not finish within the timeout, <productname>Pgpool-II</productname>
+     logs an error and continues using the built-in method.
+    </para>
+    <para>
+     Default is 10 seconds. Valid range is 1-3600 seconds.
+    </para>
+    <para>
+     This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+    </para>
+   </listitem>
+  </varlistentry>
+
   <varlistentry id="guc-log-standby-delay" xreflabel="log_standby_delay">
    <term><varname>log_standby_delay</varname> (<type>enum</type>)
     <indexterm>
diff --git a/src/config/pool_config_variables.c b/src/config/pool_config_variables.c
index 62a05979a..a35d2200f 100644
--- a/src/config/pool_config_variables.c
+++ b/src/config/pool_config_variables.c
@@ -980,6 +980,16 @@ static struct config_string ConfigureNamesString[] =
 		NULL, NULL, NULL, NULL
 	},
 
+	{
+		{"replication_delay_source_cmd", CFGCXT_RELOAD, STREAMING_REPLICATION_CONFIG,
+			"External command to retrieve replication delay information.",
+			CONFIG_VAR_TYPE_STRING, false, 0
+		},
+		&g_pool_config.replication_delay_source_cmd,
+		"",
+		NULL, NULL, NULL, NULL
+	},
+
 	{
 		{"failback_command", CFGCXT_RELOAD, FAILOVER_CONFIG,
 			"Command to execute when backend node is attached.",
@@ -2334,6 +2344,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"replication_delay_source_timeout", CFGCXT_RELOAD, STREAMING_REPLICATION_CONFIG,
+			"Timeout for external replication delay command execution in seconds.",
+			CONFIG_VAR_TYPE_INT, false, 0
+		},
+		&g_pool_config.replication_delay_source_timeout,
+		10,
+		1, 3600,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	EMPTY_CONFIG_INT
 };
diff --git a/src/include/pool_config.h b/src/include/pool_config.h
index 9160a31c8..5bc646805 100644
--- a/src/include/pool_config.h
+++ b/src/include/pool_config.h
@@ -86,7 +86,6 @@ typedef enum LogStandbyDelayModes
 	LSD_NONE
 } LogStandbyDelayModes;
 
-
 typedef enum MemCacheMethod
 {
 	SHMEM_CACHE = 1,
@@ -363,6 +362,8 @@ typedef struct
 	char	   *sr_check_password;	/* password for sr_check_user */
 	char	   *sr_check_database;	/* PostgreSQL database name for streaming
 									 * replication check */
+	char	   *replication_delay_source_cmd;	/* external command for replication delay */
+	int			replication_delay_source_timeout;	/* timeout for external command in seconds */
 	char	   *failover_command;	/* execute command when failover happens */
 	char	   *follow_primary_command; /* execute command when failover is
 										 * ended */
diff --git a/src/sample/pgpool.conf.sample-stream b/src/sample/pgpool.conf.sample-stream
index ba6b923b0..34462fd59 100644
--- a/src/sample/pgpool.conf.sample-stream
+++ b/src/sample/pgpool.conf.sample-stream
@@ -519,6 +519,20 @@ backend_clustering_mode = streaming_replication
 
 #sr_check_database = 'postgres'
                                    # Database name for streaming replication check
+
+#replication_delay_source_cmd = ''
+                                   # External command to retrieve replication delay information
+                                   # If set, pgpool uses this command instead of built-in queries
+                                   # Command receives replica node identifiers (host:port) as arguments
+                                   # Primary node is omitted from arguments
+                                   # Command should output one delay value (in ms) per replica
+                                   # Use -1 to indicate a replica that is down but not yet detected
+                                   # Format: "25 100" for 2 replicas (e.g., 3-node cluster with 1 primary)
+                                   # Command runs as the pgpool process user
+#replication_delay_source_timeout = 10
+                                   # Timeout for external command execution in seconds
+                                   # Range: 1-3600 seconds (default: 10)
+
 #delay_threshold = 0
                                    # Threshold before not dispatching query to standby node
                                    # Unit is in bytes
diff --git a/src/streaming_replication/pool_worker_child.c b/src/streaming_replication/pool_worker_child.c
index 5bf19c37d..81dc82922 100644
--- a/src/streaming_replication/pool_worker_child.c
+++ b/src/streaming_replication/pool_worker_child.c
@@ -76,6 +76,8 @@ static volatile sig_atomic_t restart_request = 0;
 static void establish_persistent_connection(void);
 static void discard_persistent_connection(void);
 static void check_replication_time_lag(void);
+static void check_replication_time_lag_with_cmd(void);
+static char *build_instance_identifier_for_node(int node_id);
 static void CheckReplicationTimeLagErrorCb(void *arg);
 static unsigned long long int text_to_lsn(char *text);
 static RETSIGTYPE my_signal_handler(int sig);
@@ -259,11 +261,16 @@ do_worker_child(void *params)
 					POOL_NODE_STATUS *node_status;
 					int			i;
 
-					/* Do replication time lag checking */
-					check_replication_time_lag();
+			/* Do replication time lag checking */
+			/* Use external command if replication_delay_source_cmd is configured */
+			if (pool_config->replication_delay_source_cmd &&
+				strlen(pool_config->replication_delay_source_cmd) > 0)
+				check_replication_time_lag_with_cmd();
+			else
+				check_replication_time_lag();
 
-					/* Check node status */
-					node_status = verify_backend_node_status(slots);
+			/* Check node status */
+			node_status = verify_backend_node_status(slots);
 
 
 					for (i = 0; i < NUM_BACKENDS; i++)
@@ -659,6 +666,327 @@ check_replication_time_lag(void)
 	error_context_stack = callback.previous;
 }
 
+#define MAX_CMD_OUTPUT 4096
+#define MAX_REASONABLE_DELAY_MS 3600000.0  /* 1 hour in milliseconds */
+
+/* Global variable to track command timeout */
+static volatile sig_atomic_t command_timeout_occurred = 0;
+
+/*
+ * Signal handler for command timeou
+ */
+static void
+command_timeout_handler(int sig)
+{
+	command_timeout_occurred = 1;
+}
+
+
+
+/*
+ * Check replication time lag using external command
+ *
+ * The external command receives only replica (standby) node identifiers as arguments,
+ * omitting the primary node. It returns delay values in milliseconds for each replica.
+ * A value of -1 indicates a node that is down but not yet detected by pgpool's health checks.
+ */
+static void
+check_replication_time_lag_with_cmd(void)
+{
+	FILE		   *fp;
+	char		   *command;
+	char		   *line;
+	char		   *token;
+	char		   *saveptr;
+	int				replica_idx;
+	int				num_replicas;
+	double			delay_ms;
+	uint64			delay;
+	int				token_count = 0;
+	BackendInfo	   *bkinfo;
+	ErrorContextCallback callback;
+
+	if (NUM_BACKENDS <= 1)
+	{
+		/* If there's only one node, there's no point to do checking */
+		return;
+	}
+
+	if (REAL_PRIMARY_NODE_ID < 0)
+	{
+		/* No need to check if there's no primary */
+		return;
+	}
+
+	if (!VALID_BACKEND(REAL_PRIMARY_NODE_ID))
+	{
+		/* No need to check replication delay if primary is down */
+		return;
+	}
+
+	if (!pool_config->replication_delay_source_cmd ||
+		strlen(pool_config->replication_delay_source_cmd) == 0)
+	{
+		ereport(WARNING,
+				(errmsg("replication_delay_source_cmd is not configured"),
+				 errhint("Set replication_delay_source_cmd to use external command mode")));
+		/* Fall back to builtin method */
+		check_replication_time_lag();
+		return;
+	}
+
+	/* Allocate buffer for command output */
+	line = palloc(MAX_CMD_OUTPUT);
+	fp = NULL;
+
+	/*
+	 * Register a error context callback to throw proper context message
+	 */
+	callback.callback = CheckReplicationTimeLagErrorCb;
+	callback.arg = NULL;
+	callback.previous = error_context_stack;
+	error_context_stack = &callback;
+
+	/* Execute command as current process user */
+	PG_TRY();
+	{
+		const char *base_command = pool_config->replication_delay_source_cmd;
+		size_t total_len = strlen(base_command) + 1; /* +1 for NUL */
+
+		/* Build command with replica-only arguments (omit primary) */
+		/* Calculate total command length including space-separated replica identifiers */
+		for (int i = 0; i < NUM_BACKENDS; i++)
+		{
+			if (i == REAL_PRIMARY_NODE_ID)
+				continue; /* Skip primary node */
+
+			char *ident = build_instance_identifier_for_node(i);
+			total_len += 1 /* space */ + strlen(ident);
+			pfree(ident);
+		}
+
+		command = palloc(total_len);
+		strlcpy(command, base_command, total_len);
+
+		/* Append replica identifiers */
+		for (int i = 0; i < NUM_BACKENDS; i++)
+		{
+			if (i == REAL_PRIMARY_NODE_ID)
+				continue; /* Skip primary node */
+
+			char *ident = build_instance_identifier_for_node(i);
+			strlcat(command, " ", total_len);
+			strlcat(command, ident, total_len);
+			pfree(ident);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("executing replication delay command: %s", command)));
+
+		/* Set up timeout for command execution */
+		command_timeout_occurred = 0;
+		signal(SIGALRM, command_timeout_handler);
+		alarm(pool_config->replication_delay_source_timeout);
+
+		fp = popen(command, "r");
+		if (fp == NULL)
+		{
+			alarm(0); /* Cancel alarm */
+			signal(SIGALRM, SIG_DFL);
+			ereport(ERROR,
+					(errmsg("failed to execute replication delay command: %s", command),
+					 errdetail("popen failed: %m")));
+		}
+
+		if (fgets(line, MAX_CMD_OUTPUT, fp) == NULL)
+		{
+			int pclose_result = pclose(fp);
+			fp = NULL;
+			alarm(0); /* Cancel alarm */
+			signal(SIGALRM, SIG_DFL);
+
+			if (command_timeout_occurred)
+			{
+				ereport(ERROR,
+						(errmsg("replication delay command timed out after %d seconds: %s",
+								pool_config->replication_delay_source_timeout, command),
+						 errhint("Consider increasing replication_delay_source_timeout or optimizing the command")));
+			}
+			else
+			{
+				ereport(ERROR,
+						(errmsg("failed to read output from replication delay command: %s", command),
+						 errdetail("command exit status: %d", pclose_result)));
+			}
+		}
+
+		alarm(0); /* Cancel alarm */
+		signal(SIGALRM, SIG_DFL);
+
+		/* Check if output was truncated */
+		if (strlen(line) == MAX_CMD_OUTPUT - 1 && line[MAX_CMD_OUTPUT - 2] != '\n')
+		{
+			ereport(WARNING,
+					(errmsg("replication delay command output may have been truncated")));
+		}
+
+		pclose(fp);
+		fp = NULL;
+		pfree(command);
+		command = NULL;
+
+		/* Set primary node delay to 0 */
+		bkinfo = pool_get_node_info(REAL_PRIMARY_NODE_ID);
+		bkinfo->standby_delay = 0;
+		bkinfo->standby_delay_by_time = true;
+
+		/* Count expected replicas */
+		num_replicas = NUM_BACKENDS - 1; /* Total nodes minus primary */
+
+		/* Count tokens in output for validation */
+		char *line_copy = pstrdup(line);
+		char *temp_token = strtok(line_copy, " \t\n");
+		while (temp_token != NULL)
+		{
+			token_count++;
+			temp_token = strtok(NULL, " \t\n");
+		}
+		pfree(line_copy);
+
+		if (token_count != num_replicas)
+		{
+			ereport(WARNING,
+					(errmsg("replication delay command returned %d values, expected %d (one per replica, excluding primary)",
+							token_count, num_replicas),
+					 errhint("Command should output one delay value per replica node")));
+		}
+
+		/* Parse the output - one delay value per replica in order */
+		token = strtok_r(line, " \t\n", &saveptr);
+		replica_idx = 0;
+
+		for (int i = 0; i < NUM_BACKENDS && token != NULL; i++)
+		{
+			if (i == REAL_PRIMARY_NODE_ID)
+				continue; /* Skip primary - it's not in the output */
+
+			if (!VALID_BACKEND(i))
+			{
+				/* Skip invalid backend but consume token */
+				token = strtok_r(NULL, " \t\n", &saveptr);
+				replica_idx++;
+				continue;
+			}
+
+			char *endptr;
+			delay_ms = strtod(token, &endptr);
+
+			/* Validate the conversion */
+			if (*endptr != '\0')
+			{
+				ereport(WARNING,
+						(errmsg("invalid delay value '%s' for node %d, treating as 0",
+								token, i)));
+				delay_ms = 0;
+			}
+
+			bkinfo = pool_get_node_info(i);
+
+			/* Handle -1 for down nodes */
+			if (delay_ms == -1.0)
+			{
+				ereport(LOG,
+						(errmsg("node %d reported as down by external command (delay -1), relying on health check for failover decision",
+								i)));
+				/* Keep previous delay value, don't trigger failover */
+				token = strtok_r(NULL, " \t\n", &saveptr);
+				replica_idx++;
+				continue;
+			}
+
+			/* Validate delay value range */
+			if (delay_ms < 0)
+			{
+				ereport(WARNING,
+						(errmsg("negative delay value %.3f for node %d (other than -1), treating as 0",
+								delay_ms, i)));
+				delay_ms = 0;
+			}
+			else if (delay_ms > MAX_REASONABLE_DELAY_MS)
+			{
+				ereport(WARNING,
+						(errmsg("extremely large delay value %.3f for node %d",
+								delay_ms, i)));
+			}
+
+			/* Convert delay from milliseconds to microseconds for internal storage */
+			delay = (uint64)(delay_ms * 1000);
+			bkinfo->standby_delay = delay;
+			bkinfo->standby_delay_by_time = true;
+
+			/* Log delay if necessary */
+			uint64 delay_threshold_by_time = pool_config->delay_threshold_by_time * 1000; /* threshold is in milliseconds, convert to microseconds */
+
+			if ((pool_config->log_standby_delay == LSD_ALWAYS && delay_ms > 0) ||
+				(pool_config->log_standby_delay == LSD_OVER_THRESHOLD &&
+				 bkinfo->standby_delay > delay_threshold_by_time))
+			{
+				ereport(LOG,
+						(errmsg("Replication of node: %d is behind %.3f second(s) from the primary server (node: %d) [external command]",
+								i, delay_ms / 1000, REAL_PRIMARY_NODE_ID)));
+			}
+
+			token = strtok_r(NULL, " \t\n", &saveptr);
+			replica_idx++;
+		}
+
+	}
+	PG_CATCH();
+	{
+		/* Cleanup in case of error */
+		alarm(0); /* Cancel any pending alarm */
+		signal(SIGALRM, SIG_DFL);
+		if (fp)
+		{
+			pclose(fp);
+			fp = NULL;
+		}
+		if (line)
+			pfree(line);
+		if (command)
+			pfree(command);
+		error_context_stack = callback.previous;
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	/* Normal cleanup */
+	if (line)
+		pfree(line);
+
+	error_context_stack = callback.previous;
+}
+
+/*
+ * build_instance_identifier_for_node
+ *  Build an identifier string for a backend node for passing to external commands.
+ *  Format: "<hostname>:<port>"
+ */
+static char *
+build_instance_identifier_for_node(int node_id)
+{
+	BackendInfo *bi = pool_get_node_info(node_id);
+
+	if (!bi || bi->backend_hostname[0] == '\0' || bi->backend_port <= 0)
+	{
+		/* Fallback if hostname or port is not set */
+		return psprintf("unknown_node_%d", node_id);
+	}
+
+	/* Use hostname:port format */
+	return psprintf("%s:%d", bi->backend_hostname, bi->backend_port);
+}
+
 static void
 CheckReplicationTimeLagErrorCb(void *arg)
 {
diff --git a/src/test/regression/tests/041.external_replication_delay/README b/src/test/regression/tests/041.external_replication_delay/README
new file mode 100644
index 000000000..b4df5da40
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/README
@@ -0,0 +1,59 @@
+External Replication Delay Command Test
+========================================
+
+This test verifies the external command replication delay source feature.
+
+Test Coverage:
+- External command receives replica node identifiers only (primary omitted)
+- Instance identifiers in host:port format
+- Basic external command execution with integer and float millisecond values
+- Delay threshold functionality with external commands
+- Command execution as pgpool process user (no su wrapper)
+- Error handling for missing/invalid commands
+- Command execution failure scenarios
+- Command timeout handling with configurable timeout values
+- Input validation for invalid, negative (other than -1), and extremely large delay values
+- Handling of -1 for down nodes (logged but no immediate failover)
+- Wrong number of output values validation
+- Multiple -1 values (multiple down replicas)
+- Mixed scenarios (some replicas up, some down)
+- Output truncation detection
+
+Files:
+- test.sh: Main test script
+- test_parsing.sh: Unit test for parsing logic
+- test_validation.sh: Validation and edge case testing
+- README: This documentation
+
+Key Changes from Original Version:
+- Primary node is omitted from command arguments
+- Command receives only replica identifiers
+- Instance identifiers are in host:port format (not application_name)
+- Output format: one delay per replica (not per all nodes)
+- -1 value indicates down replica without triggering failover
+- Format example: "25 100" for 2 replicas (3-node cluster = 1 primary + 2 replicas)
+
+The test creates temporary command scripts that output delay values in the format:
+"replica1_delay replica2_delay ..."
+
+Where delays are in milliseconds and can be integer or floating-point values.
+Special value -1 indicates a replica that is down but not yet detected by pgpool.
+
+Test Environment:
+- Uses streaming replication mode with 3 nodes
+- Node 0 is primary (omitted from command arguments)
+- Nodes 1 and 2 are replicas (included in command arguments)
+- Configures sr_check_period = 1 second for faster testing
+- Tests various delay scenarios and threshold behaviors
+
+Expected Behavior:
+- External commands receive replica identifiers in host:port format
+- Primary node identifier is never passed to command
+- Command outputs one delay value per replica
+- -1 values are logged but don't trigger immediate failover
+- Delay values are parsed correctly (both int and float)
+- Threshold comparisons work properly
+- Error conditions are handled gracefully
+- Commands timeout appropriately based on configuration
+- Timeout errors provide helpful messages and hints
+- Tests are reliable with proper wait mechanisms instead of fixed sleeps
diff --git a/src/test/regression/tests/041.external_replication_delay/test.sh b/src/test/regression/tests/041.external_replication_delay/test.sh
new file mode 100755
index 000000000..e1dfbcecf
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/test.sh
@@ -0,0 +1,404 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for external command replication delay source
+#
+source $TESTLIBS
+TESTDIR=testdir
+PG_CTL=$PGBIN/pg_ctl
+PSQL="$PGBIN/psql -X "
+
+rm -fr $TESTDIR
+mkdir $TESTDIR
+cd $TESTDIR
+
+# create test environment
+echo -n "creating test environment..."
+$PGPOOL_SETUP -m s -n 3 || exit 1
+echo "done."
+source ./bashrc.ports
+export PGPORT=$PGPOOL_PORT
+
+# Create external command scripts for testing
+# NOTE: Commands now only output delay values for REPLICAS (not primary)
+cat > delay_cmd_static.sh << 'EOF'
+#!/bin/bash
+# Static delay values for replicas: node1=25ms, node2=50ms (node0 is primary, not included)
+echo "25 50"
+EOF
+chmod +x delay_cmd_static.sh
+
+cat > delay_cmd_float.sh << 'EOF'
+#!/bin/bash
+# Float delay values for replicas: node1=25.5ms, node2=100.75ms
+echo "25.5 100.75"
+EOF
+chmod +x delay_cmd_float.sh
+
+cat > delay_cmd_high.sh << 'EOF'
+#!/bin/bash
+# High delay values to test threshold: node1=2000ms, node2=3000ms
+echo "2000 3000"
+EOF
+chmod +x delay_cmd_high.sh
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test0: External command receives replica identifiers only (primary omitted) ==="
+# ----------------------------------------------------------------------------------------
+# Command that captures its arguments and outputs valid delays for 2 replicas
+cat > delay_cmd_args.sh << 'EOF'
+#!/bin/bash
+printf "%s " "$@" > args.txt
+echo "25 50"
+EOF
+chmod +x delay_cmd_args.sh
+
+echo "replication_delay_source_cmd = './delay_cmd_args.sh'" >> etc/pgpool.conf
+echo "sr_check_period = 1" >> etc/pgpool.conf
+echo "log_min_messages = 'DEBUG1'" >> etc/pgpool.conf
+# Reduce memory requirements for macOS shared memory limits
+echo "num_init_children = 4" >> etc/pgpool.conf
+echo "max_pool = 2" >> etc/pgpool.conf
+# Disable query caching to avoid shared memory issues on macOS
+echo "memory_cache_enabled = off" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+echo "Waiting for sr_check to pass args..."
+for i in {1..10}; do
+    if [ -f args.txt ]; then
+        break
+    fi
+    sleep 1
+done
+
+if [ ! -f args.txt ]; then
+    echo fail: did not capture command arguments
+    ./shutdownall
+    exit 1
+fi
+
+ARGS_CONTENT=$(cat args.txt | sed 's/[[:space:]]*$//')
+# Should receive 2 replica identifiers in host:port format (localhost:11003 localhost:11004 or server1:11003 server2:11004)
+# Primary (localhost:11002 or server0:11002) should be omitted
+if ! echo "$ARGS_CONTENT" | grep -qE "(server1|localhost):11003"; then
+    echo "fail: expected replica1:11003 in arguments, got: '$ARGS_CONTENT'"
+    ./shutdownall
+    exit 1
+fi
+if ! echo "$ARGS_CONTENT" | grep -qE "(server2|localhost):11004"; then
+    echo "fail: expected replica2:11004 in arguments, got: '$ARGS_CONTENT'"
+    ./shutdownall
+    exit 1
+fi
+if echo "$ARGS_CONTENT" | grep -qE "(server0|localhost):11002"; then
+    echo "fail: primary should not be in arguments, got: '$ARGS_CONTENT'"
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: argument order correct - replicas only, primary omitted, host:port format
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test1: Basic external command with integer millisecond values ==="
+# ----------------------------------------------------------------------------------------
+echo "replication_delay_source_cmd = './delay_cmd_static.sh'" >> etc/pgpool.conf
+echo "sr_check_period = 1" >> etc/pgpool.conf
+echo "log_standby_delay = 'always'" >> etc/pgpool.conf
+echo "log_min_messages = 'DEBUG1'" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+$PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+EOF
+
+# Wait for sr_check to run and populate delay values
+# sr_check_period is 1 second, so wait a bit longer to ensure it runs
+echo "Waiting for sr_check to run..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command" log/pgpool.log 2>/dev/null; then
+        echo "Command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SHOW POOL_NODES;
+EOF
+
+# Check that delay values are populated in the log
+grep "executing replication delay command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: external command was not executed
+    echo "Log contents:"
+    tail -20 log/pgpool.log
+    ./shutdownall
+    exit 1
+fi
+
+# Verify actual delay values were parsed
+if ! $PSQL -t -c "SHOW POOL_NODES" test | grep -E "[0-9]+\.[0-9]+" >/dev/null; then
+    echo "Warning: No delay values found in POOL_NODES output"
+fi
+
+# Check for delay log messages
+grep "Replication of node.*external command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: external command delay logging not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: basic external command test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test2: External command with floating-point millisecond values ==="
+# ----------------------------------------------------------------------------------------
+# Update configuration to use float command
+sed -i.bak "s|delay_cmd_static.sh|delay_cmd_float.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run with float values
+echo "Waiting for sr_check with float values..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*delay_cmd_float.sh" log/pgpool.log 2>/dev/null; then
+        echo "Float command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SHOW POOL_NODES;
+EOF
+
+# Check that float values are handled correctly
+grep "executing replication delay command.*delay_cmd_float.sh" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: float command was not executed
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: floating-point values test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test3: External command with delay threshold ==="
+# ----------------------------------------------------------------------------------------
+# Update configuration to use high delay command and set threshold
+sed -i.bak "s|delay_cmd_float.sh|delay_cmd_high.sh|" etc/pgpool.conf
+echo "delay_threshold_by_time = 1000" >> etc/pgpool.conf
+echo "backend_weight0 = 0" >> etc/pgpool.conf  # Force queries to standby normally
+echo "backend_weight2 = 0" >> etc/pgpool.conf  # Only use node 1 as standby
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run and detect high delays
+echo "Waiting for sr_check with high delay values..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*delay_cmd_high.sh" log/pgpool.log 2>/dev/null; then
+        echo "High delay command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SELECT * FROM t1 LIMIT 1;
+EOF
+
+# With high delays (2000ms > 1000ms threshold), query should go to primary (node 0)
+# Log format can vary: either "statement: SELECT..." or "SELECT... DB node id:"
+if ! grep -E "DB node id: 0.*statement: SELECT \* FROM t1 LIMIT 1" log/pgpool.log >/dev/null 2>&1 && \
+   ! grep -E "SELECT \* FROM t1 LIMIT 1.*DB node id: 0" log/pgpool.log >/dev/null 2>&1; then
+    echo fail: query was not sent to primary node despite high delay
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: delay threshold test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test4: External command execution as process user ==="
+# ----------------------------------------------------------------------------------------
+# Test that command runs as the current pgpool process user
+sed -i.bak "s|delay_cmd_high.sh|delay_cmd_static.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for sr_check to run as process user..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*delay_cmd_static.sh" log/pgpool.log 2>/dev/null; then
+        echo "Command executed as process user after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check that command was executed (without su wrapper)
+grep "executing replication delay command.*delay_cmd_static.sh" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: command was not executed as process user
+    ./shutdownall
+    exit 1
+fi
+
+# Verify no su command was used
+if grep -q "executing replication delay command.*su.*" log/pgpool.log 2>/dev/null; then
+    echo fail: command should not use su wrapper
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: process user execution test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test5: Error handling - missing command ==="
+# ----------------------------------------------------------------------------------------
+# Test error handling when command is not configured
+sed -i.bak "s|replication_delay_source_cmd = './delay_cmd_static.sh'|replication_delay_source_cmd = ''|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# With empty command, should fall back to builtin method
+# No specific error message expected - just verify it doesn't crash
+sleep 3
+
+echo "ok: empty command test succeeded (fallback to builtin)"
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test6: Error handling - command execution failure ==="
+# ----------------------------------------------------------------------------------------
+# Test error handling when command fails
+echo "replication_delay_source_cmd = './nonexistent_command.sh'" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run with failing command
+echo "Waiting for sr_check with failing command..."
+for i in {1..5}; do
+    if grep -qE "failed to (execute|read output from) replication delay command" log/pgpool.log 2>/dev/null; then
+        echo "Command failure detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for error message about command execution failure
+# Accept multiple possible error messages depending on shell behavior
+if ! grep -qE "failed to (execute|read output from) replication delay command" log/pgpool.log 2>/dev/null; then
+    echo fail: command execution failure not detected
+    echo "Log contents:"
+    tail -50 log/pgpool.log
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: command failure test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test7: Command timeout handling ==="
+# ----------------------------------------------------------------------------------------
+# Create a command that takes longer than the timeout
+cat > delay_cmd_slow.sh << 'EOF'
+#!/bin/bash
+# Slow command that takes 15 seconds (longer than default 10s timeout)
+sleep 15
+echo "25 50"
+EOF
+chmod +x delay_cmd_slow.sh
+
+# Set a short timeout and use the slow command
+sed -i.bak "s|replication_delay_source_cmd = './nonexistent_command.sh'|replication_delay_source_cmd = './delay_cmd_slow.sh'|" etc/pgpool.conf
+echo "replication_delay_source_timeout = 3" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run and timeout
+echo "Waiting for command timeout..."
+for i in {1..15}; do
+    if grep -q "replication delay command timed out" log/pgpool.log 2>/dev/null; then
+        echo "Command timeout detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for timeout error message
+grep "replication delay command timed out after 3 seconds" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: command timeout not detected
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: command timeout test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test8: Handling of -1 for down nodes ==="
+# ----------------------------------------------------------------------------------------
+# Create a command that returns -1 for one replica
+cat > delay_cmd_with_down_node.sh << 'EOF'
+#!/bin/bash
+# Return -1 for first replica (indicating it's down), normal value for second
+echo "-1 50"
+EOF
+chmod +x delay_cmd_with_down_node.sh
+
+# Reset config
+rm -f etc/pgpool.conf.bak
+sed -i.bak "s|delay_cmd_slow.sh|delay_cmd_with_down_node.sh|" etc/pgpool.conf
+sed -i.bak "s|replication_delay_source_timeout = 3|replication_delay_source_timeout = 10|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to process -1 value
+echo "Waiting for sr_check to process -1 value..."
+for i in {1..10}; do
+    if grep -q "node.*reported as down by external command.*delay -1" log/pgpool.log 2>/dev/null; then
+        echo "-1 handling detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for -1 logging message
+grep "node.*reported as down by external command.*delay -1.*relying on health check" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: -1 handling message not found
+    ./shutdownall
+    exit 1
+fi
+
+# Verify that pgpool didn't crash or trigger failover just from -1
+if grep -q "failover" log/pgpool.log 2>/dev/null; then
+    echo "fail: -1 should not trigger immediate failover"
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: -1 handling test succeeded
+./shutdownall
+
+echo "All external replication delay tests passed!"
+exit 0
diff --git a/src/test/regression/tests/041.external_replication_delay/test_parsing.sh b/src/test/regression/tests/041.external_replication_delay/test_parsing.sh
new file mode 100755
index 000000000..82fdad144
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/test_parsing.sh
@@ -0,0 +1,54 @@
+#!/bin/bash
+#-------------------------------------------------------------------
+# Unit test for external command parsing logic
+# This tests the parsing without needing a full pgpool setup
+#
+
+echo "=== Testing external command output parsing ==="
+
+# Test 1: Integer values
+echo "Test 1: Integer millisecond values"
+echo "0 25 50" > test_output.txt
+echo "Expected: 0ms, 25ms, 50ms"
+echo "Output: $(cat test_output.txt)"
+echo ""
+
+# Test 2: Float values
+echo "Test 2: Floating-point millisecond values"
+echo "0 25.5 100.75" > test_output_float.txt
+echo "Expected: 0ms, 25.5ms, 100.75ms"
+echo "Output: $(cat test_output_float.txt)"
+echo ""
+
+# Test 3: High precision float values
+echo "Test 3: High precision values"
+echo "0 0.001 999.999" > test_output_precision.txt
+echo "Expected: 0ms, 0.001ms, 999.999ms"
+echo "Output: $(cat test_output_precision.txt)"
+echo ""
+
+# Test 4: Edge case - zero values
+echo "Test 4: All zero values"
+echo "0 0 0" > test_output_zeros.txt
+echo "Expected: 0ms, 0ms, 0ms"
+echo "Output: $(cat test_output_zeros.txt)"
+echo ""
+
+# Test 5: Edge case - large values
+echo "Test 5: Large delay values"
+echo "0 5000 10000" > test_output_large.txt
+echo "Expected: 0ms, 5000ms, 10000ms"
+echo "Output: $(cat test_output_large.txt)"
+echo ""
+
+# Test 6: Mixed integer and float values
+echo "Test 6: Mixed integer and float values"
+echo "0 25 50.5" > test_output_mixed.txt
+echo "Expected: 0ms, 25ms, 50.5ms"
+echo "Output: $(cat test_output_mixed.txt)"
+echo ""
+
+# Cleanup
+rm -f test_output_*.txt
+
+echo "All parsing tests completed. These outputs should be parseable by the external command feature."
diff --git a/src/test/regression/tests/041.external_replication_delay/test_validation.sh b/src/test/regression/tests/041.external_replication_delay/test_validation.sh
new file mode 100755
index 000000000..2cd4a7f0b
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/test_validation.sh
@@ -0,0 +1,323 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for external command validation and edge cases
+#
+source $TESTLIBS
+TESTDIR=testdir_validation
+PG_CTL=$PGBIN/pg_ctl
+PSQL="$PGBIN/psql -X "
+
+rm -fr $TESTDIR
+mkdir $TESTDIR
+cd $TESTDIR
+
+# create test environment
+echo -n "creating test environment..."
+$PGPOOL_SETUP -m s -n 3 || exit 1
+echo "done."
+source ./bashrc.ports
+export PGPORT=$PGPOOL_PORT
+
+# Create test command scripts
+# NOTE: All commands output values for REPLICAS only (primary omitted)
+cat > delay_cmd_validation.sh << 'EOF'
+#!/bin/bash
+# Test validation: output with invalid values for 2 replicas
+echo "invalid_value 50.5"
+EOF
+chmod +x delay_cmd_validation.sh
+
+cat > delay_cmd_negative.sh << 'EOF'
+#!/bin/bash
+# Test negative values (other than -1)
+echo "-25 50"
+EOF
+chmod +x delay_cmd_negative.sh
+
+cat > delay_cmd_large.sh << 'EOF'
+#!/bin/bash
+# Test extremely large values
+echo "9999999 50"
+EOF
+chmod +x delay_cmd_large.sh
+
+cat > delay_cmd_wrong_count.sh << 'EOF'
+#!/bin/bash
+# Test wrong number of values (only 1 instead of 2 for 2 replicas)
+echo "25"
+EOF
+chmod +x delay_cmd_wrong_count.sh
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test1: Validation of invalid delay values ==="
+# ----------------------------------------------------------------------------------------
+echo "replication_delay_source_cmd = './delay_cmd_validation.sh'" >> etc/pgpool.conf
+echo "sr_check_period = 1" >> etc/pgpool.conf
+echo "log_standby_delay = 'always'" >> etc/pgpool.conf
+echo "log_min_messages = 'DEBUG1'" >> etc/pgpool.conf
+# Reduce memory requirements for macOS shared memory limits
+echo "num_init_children = 4" >> etc/pgpool.conf
+echo "max_pool = 2" >> etc/pgpool.conf
+# Disable query caching to avoid shared memory issues on macOS
+echo "memory_cache_enabled = off" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+$PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+EOF
+
+# Wait for sr_check to run
+echo "Waiting for validation test..."
+for i in {1..10}; do
+    if grep -q "invalid delay value" log/pgpool.log 2>/dev/null; then
+        echo "Validation error detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for validation warning
+grep "invalid delay value 'invalid_value' for node" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: validation warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: invalid value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test2: Negative delay values (other than -1) ==="
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_validation.sh|delay_cmd_negative.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for negative value test..."
+for i in {1..10}; do
+    if grep -q "negative delay value.*other than -1" log/pgpool.log 2>/dev/null; then
+        echo "Negative value warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for negative value warning
+grep "negative delay value.*other than -1.*treating as 0" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: negative value warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: negative value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test3: Extremely large delay values ==="
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_negative.sh|delay_cmd_large.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for large value test..."
+for i in {1..10}; do
+    if grep -q "extremely large delay value" log/pgpool.log 2>/dev/null; then
+        echo "Large value warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for large value warning
+grep "extremely large delay value.*for node" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: large value warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: large value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test4: Wrong number of output values ==="
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_large.sh|delay_cmd_wrong_count.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for wrong count test..."
+for i in {1..10}; do
+    if grep -q "returned.*values, expected.*replica" log/pgpool.log 2>/dev/null; then
+        echo "Wrong count warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for wrong count warning
+grep "returned.*values, expected.*replica.*Command should output one delay value per replica" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: wrong count validation test not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: wrong count validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test5: Multiple -1 values ==="
+# ----------------------------------------------------------------------------------------
+cat > delay_cmd_multi_down.sh << 'EOF'
+#!/bin/bash
+# Test multiple replicas down
+echo "-1 -1"
+EOF
+chmod +x delay_cmd_multi_down.sh
+
+sed -i.bak "s|delay_cmd_wrong_count.sh|delay_cmd_multi_down.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for multi-down test..."
+for i in {1..10}; do
+    if grep -q "node.*reported as down by external command" log/pgpool.log 2>/dev/null; then
+        echo "Multiple down nodes detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for multiple -1 handling
+DOWN_COUNT=$(grep -c "node.*reported as down by external command.*delay -1" log/pgpool.log)
+if [ "$DOWN_COUNT" -lt 2 ]; then
+    echo fail: expected 2 down node messages, found $DOWN_COUNT
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: multiple -1 handling test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test6: Command timeout with different timeout values ==="
+# ----------------------------------------------------------------------------------------
+cat > delay_cmd_timeout.sh << 'EOF'
+#!/bin/bash
+# Command that takes 5 seconds
+sleep 5
+echo "25 50"
+EOF
+chmod +x delay_cmd_timeout.sh
+
+# Test with timeout shorter than command duration
+sed -i.bak "s|delay_cmd_multi_down.sh|delay_cmd_timeout.sh|" etc/pgpool.conf
+echo "replication_delay_source_timeout = 2" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for timeout
+echo "Waiting for timeout test (2s timeout, 5s command)..."
+for i in {1..10}; do
+    if grep -q "replication delay command timed out after 2 seconds" log/pgpool.log 2>/dev/null; then
+        echo "Timeout detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for timeout message
+grep "replication delay command timed out after 2 seconds" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: timeout not detected
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: timeout test succeeded
+./shutdownall
+
+# Test with timeout longer than command duration
+sed -i.bak "s|replication_delay_source_timeout = 2|replication_delay_source_timeout = 10|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for successful execution
+echo "Waiting for successful execution (10s timeout, 5s command)..."
+for i in {1..15}; do
+    if grep -q "executing replication delay command.*delay_cmd_timeout.sh" log/pgpool.log 2>/dev/null; then
+        echo "Command executed successfully after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Should not timeout this time
+if grep -q "replication delay command timed out" log/pgpool.log 2>/dev/null; then
+    echo fail: command should not have timed out with 10s timeout
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: extended timeout test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test7: Mix of valid delays and -1 ==="
+# ----------------------------------------------------------------------------------------
+cat > delay_cmd_mixed.sh << 'EOF'
+#!/bin/bash
+# One replica up (25ms), one down (-1)
+echo "25 -1"
+EOF
+chmod +x delay_cmd_mixed.sh
+
+sed -i.bak "s|delay_cmd_timeout.sh|delay_cmd_mixed.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check
+echo "Waiting for mixed delay test..."
+for i in {1..10}; do
+    if grep -q "node.*reported as down by external command" log/pgpool.log 2>/dev/null; then
+        echo "Mixed delay handling detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Should log one -1 and process one normal delay
+grep "node.*reported as down by external command.*delay -1" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: -1 not logged
+    ./shutdownall
+    exit 1
+fi
+
+# Should also log the normal replica delay
+grep "Replication of node.*external command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo "Note: Normal replica delay logging may not be visible with log_standby_delay settings"
+fi
+
+echo ok: mixed delay handling test succeeded
+./shutdownall
+
+echo "All validation tests passed!"
+exit 0
\ No newline at end of file
-- 
2.51.2



^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-11-06 09:24  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2025-11-06 09:24 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

> Sorry for that - thanks for the patch.
> 
> Please find attached a new version

Thanks for the new version. Unfortunately this time regression test
fails at:

> Waiting for command timeout...
> fail: command timeout not detected

Attached is the pgpool.log.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

> On Mon, Nov 3, 2025 at 9:05 AM Tatsuo Ishii <[email protected]> wrote:
> 
>> > thanks and sorry for the issues, please find attached updated version.
>>
>> No problem.
>>
>> This time the patch applies fine, no compiler warnings.  However,
>> regression test did not passed here (on Ubuntu 24 LTS if this
>> matters).  So I looked into
>> src/test/regression/tests/041.external_replication_delay/test.sh a
>> little bit and apply attached patch (test.sh.patch). It moved forward
>> partially but failed at:
>>
>> fail: command execution failure not detected
>>
>> Please find attached
>> src/test/regression/tests/041.external_replication_delay/testdir/pgpool.log
>> and src/test/regression/log/041.external_replication_delay.
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
> 
> 
> -- 
> Nadav Shatz
> Tailor Brands | CTO

2025-11-06 18:19:45.730: main pid 87757: DEBUG:  initializing pool configuration
2025-11-06 18:19:45.730: main pid 87757: DETAIL:  num_backends: 3 total_weight: 1.000000
2025-11-06 18:19:45.730: main pid 87757: DEBUG:  initializing pool configuration
2025-11-06 18:19:45.730: main pid 87757: DETAIL:  backend 0 weight: 0.000000 flag: 0000
2025-11-06 18:19:45.730: main pid 87757: DEBUG:  initializing pool configuration
2025-11-06 18:19:45.730: main pid 87757: DETAIL:  backend 1 weight: 2147483647.000000 flag: 0000
2025-11-06 18:19:45.730: main pid 87757: DEBUG:  initializing pool configuration
2025-11-06 18:19:45.730: main pid 87757: DETAIL:  backend 2 weight: 0.000000 flag: 0000
2025-11-06 18:19:45.737: main pid 87757: LOG:  Backend status file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/pgpool_status discarded
2025-11-06 18:19:45.737: main pid 87757: DEBUG:  BackendDesc: 113672 bytes requested for shared memory
2025-11-06 18:19:45.737: main pid 87757: DEBUG:  pool_coninfo_size: num_init_children (4) * max_pool (2) * MAX_NUM_BACKENDS (128) * sizeof(ConnectionInfo) (416) = 425984 bytes requested for shared memory
2025-11-06 18:19:45.737: main pid 87757: DEBUG:  ProcessInfo: num_init_children (4) * sizeof(ProcessInfo) (2152) = 8608 bytes requested for shared memory
2025-11-06 18:19:45.737: main pid 87757: DEBUG:  UserSignalSlot: 24 bytes requested for shared memory
2025-11-06 18:19:45.737: main pid 87757: DEBUG:  POOL_REQUEST_INFO: 5272 bytes requested for shared memory
2025-11-06 18:19:45.737: main pid 87757: DEBUG:  stat_shared_memory_size: 9216 bytes requested for shared memory
2025-11-06 18:19:45.737: main pid 87757: DEBUG:  health_check_stats_shared_memory_size: requested size: 12288
2025-11-06 18:19:45.737: main pid 87757: DEBUG:  SI_ManageInfo: 24 bytes requested for shared memory
2025-11-06 18:19:45.737: main pid 87757: DEBUG:  memcache blocks :64
2025-11-06 18:19:45.737: main pid 87757: DEBUG:  shared_memory_cache_size: 67108864
2025-11-06 18:19:45.737: main pid 87757: DEBUG:  shared_memory_fsmm_size: 64
2025-11-06 18:19:45.737: main pid 87757: DEBUG:  pool_hash_size: 67108880
2025-11-06 18:19:45.737: main pid 87757: DEBUG:  POOL_QUERY_CACHE_STATS: 24 bytes requested for shared memory
2025-11-06 18:19:45.737: main pid 87757: LOG:  allocating shared memory segment of size: 134793216 
2025-11-06 18:19:45.803: main pid 87757: DEBUG:  pool_coninfo_size: num_init_children (4) * max_pool (2) * MAX_NUM_BACKENDS (128) * sizeof(ConnectionInfo) (416) = 425984 bytes requested for shared memory
2025-11-06 18:19:45.803: main pid 87757: DEBUG:  health_check_stats_shared_memory_size: requested size: 12288
2025-11-06 18:19:45.803: main pid 87757: DEBUG:  health_check_stats_shared_memory_size: requested size: 12288
2025-11-06 18:19:45.803: main pid 87757: DEBUG:  memcache blocks :64
2025-11-06 18:19:45.803: main pid 87757: DEBUG:  shared_memory_cache_size: 67108864
2025-11-06 18:19:45.803: main pid 87757: DEBUG:  memory cache request size : 67108864
2025-11-06 18:19:45.803: main pid 87757: DEBUG:  shared_memory_fsmm_size: 64
2025-11-06 18:19:45.805: main pid 87757: LOG:  pool_discard_oid_maps: discarded memqcache oid maps
2025-11-06 18:19:45.812: main pid 87757: LOG:  create socket files[0]: /tmp/.s.PGSQL.11000
2025-11-06 18:19:45.812: main pid 87757: LOG:  listen address[0]: *
2025-11-06 18:19:45.813: main pid 87757: LOG:  Setting up socket for 0.0.0.0:11000
2025-11-06 18:19:45.813: main pid 87757: LOG:  Setting up socket for :::11000
2025-11-06 18:19:45.813: main pid 87757: DEBUG:  Spawning 4 child processes
2025-11-06 18:19:45.813: child pid 87764: DEBUG:  initializing backend status
2025-11-06 18:19:45.813: child pid 87765: DEBUG:  initializing backend status
2025-11-06 18:19:45.814: main pid 87757: LOG:  find_primary_node_repeatedly: waiting for finding a primary node
2025-11-06 18:19:45.814: child pid 87766: DEBUG:  initializing backend status
2025-11-06 18:19:45.814: child pid 87767: DEBUG:  initializing backend status
2025-11-06 18:19:45.819: main pid 87757: DEBUG:  authenticate kind = 0
2025-11-06 18:19:45.819: main pid 87757: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:45.819: main pid 87757: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:45.823: main pid 87757: DEBUG:  authenticate kind = 0
2025-11-06 18:19:45.823: main pid 87757: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:45.823: main pid 87757: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:45.826: main pid 87757: DEBUG:  authenticate kind = 0
2025-11-06 18:19:45.826: main pid 87757: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:45.826: main pid 87757: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:45.826: main pid 87757: DEBUG:  pool_acquire_follow_primary_lock: lock was not held by anyone
2025-11-06 18:19:45.826: main pid 87757: DEBUG:  pool_acquire_follow_primary_lock: succeeded in acquiring lock
2025-11-06 18:19:45.826: main pid 87757: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-06 18:19:45.826: main pid 87757: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.current_setting('server_version_num')"
2025-11-06 18:19:45.826: main pid 87757: DEBUG:  get_server_version: backend 0 server version: 180000
2025-11-06 18:19:45.826: main pid 87757: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-06 18:19:45.827: main pid 87757: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.current_setting('server_version_num')"
2025-11-06 18:19:45.827: main pid 87757: DEBUG:  get_server_version: backend 1 server version: 180000
2025-11-06 18:19:45.827: main pid 87757: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-06 18:19:45.827: main pid 87757: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.current_setting('server_version_num')"
2025-11-06 18:19:45.827: main pid 87757: DEBUG:  get_server_version: backend 2 server version: 180000
2025-11-06 18:19:45.827: main pid 87757: DEBUG:  verify_backend_node_status: multiple standbys: 2
2025-11-06 18:19:45.827: main pid 87757: DEBUG:  verify_backend_node_status: detach_false_primary is off and no additional checking is performed
2025-11-06 18:19:45.827: main pid 87757: DEBUG:  pool_release_follow_primary_lock called
2025-11-06 18:19:45.827: main pid 87757: LOG:  find_primary_node: primary node is 0
2025-11-06 18:19:45.827: main pid 87757: LOG:  find_primary_node: standby node is 1
2025-11-06 18:19:45.827: main pid 87757: LOG:  find_primary_node: standby node is 2
2025-11-06 18:19:45.827: main pid 87757: LOG:  create socket files[0]: /tmp/.s.PGSQL.11001
2025-11-06 18:19:45.827: main pid 87757: LOG:  listen address[0]: localhost
2025-11-06 18:19:45.827: main pid 87757: LOG:  Setting up socket for 127.0.0.1:11001
2025-11-06 18:19:45.827: pcp_main pid 87771: DEBUG:  I am PCP child with pid:87771
2025-11-06 18:19:45.828: pcp_main pid 87771: LOG:  PCP process: 87771 started
2025-11-06 18:19:45.828: sr_check_worker pid 87772: LOG:  process started
2025-11-06 18:19:45.828: sr_check_worker pid 87772: DEBUG:  I am 87772
2025-11-06 18:19:45.828: sr_check_worker pid 87772: DEBUG:  pool_acquire_follow_primary_lock: lock was not held by anyone
2025-11-06 18:19:45.828: sr_check_worker pid 87772: DEBUG:  pool_acquire_follow_primary_lock: succeeded in acquiring lock
2025-11-06 18:19:45.828: health_check pid 87773: LOG:  process started
2025-11-06 18:19:45.828: health_check0 pid 87773: DEBUG:  I am health check process pid:87773 DB node id:0
2025-11-06 18:19:45.828: health_check0 pid 87773: DEBUG:  health check: clearing alarm
2025-11-06 18:19:45.828: health_check pid 87774: LOG:  process started
2025-11-06 18:19:45.828: health_check1 pid 87774: DEBUG:  I am health check process pid:87774 DB node id:1
2025-11-06 18:19:45.828: health_check1 pid 87774: DEBUG:  health check: clearing alarm
2025-11-06 18:19:45.828: health_check pid 87775: LOG:  process started
2025-11-06 18:19:45.828: health_check2 pid 87775: DEBUG:  I am health check process pid:87775 DB node id:2
2025-11-06 18:19:45.828: health_check2 pid 87775: DEBUG:  health check: clearing alarm
2025-11-06 18:19:45.829: health_check1 pid 87774: DEBUG:  authenticate kind = 0
2025-11-06 18:19:45.829: health_check1 pid 87774: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:45.829: health_check1 pid 87774: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:45.829: health_check1 pid 87774: DEBUG:  health check: clearing alarm
2025-11-06 18:19:45.829: health_check1 pid 87774: DEBUG:  health check: clearing alarm
2025-11-06 18:19:45.829: health_check1 pid 87774: WARNING:  check_backend_down_request: failed to open file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/backend_down_request
2025-11-06 18:19:45.829: health_check1 pid 87774: DETAIL:  No such file or directory
2025-11-06 18:19:45.830: health_check0 pid 87773: DEBUG:  authenticate kind = 0
2025-11-06 18:19:45.830: health_check0 pid 87773: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:45.830: health_check0 pid 87773: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:45.830: health_check0 pid 87773: DEBUG:  health check: clearing alarm
2025-11-06 18:19:45.830: health_check0 pid 87773: DEBUG:  health check: clearing alarm
2025-11-06 18:19:45.830: health_check0 pid 87773: WARNING:  check_backend_down_request: failed to open file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/backend_down_request
2025-11-06 18:19:45.830: health_check0 pid 87773: DETAIL:  No such file or directory
2025-11-06 18:19:45.830: sr_check_worker pid 87772: DEBUG:  authenticate kind = 0
2025-11-06 18:19:45.830: health_check2 pid 87775: DEBUG:  authenticate kind = 0
2025-11-06 18:19:45.830: sr_check_worker pid 87772: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:45.830: health_check2 pid 87775: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:45.830: sr_check_worker pid 87772: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:45.830: health_check2 pid 87775: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:45.830: health_check2 pid 87775: DEBUG:  health check: clearing alarm
2025-11-06 18:19:45.830: health_check2 pid 87775: DEBUG:  health check: clearing alarm
2025-11-06 18:19:45.830: health_check2 pid 87775: WARNING:  check_backend_down_request: failed to open file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/backend_down_request
2025-11-06 18:19:45.830: health_check2 pid 87775: DETAIL:  No such file or directory
2025-11-06 18:19:45.831: sr_check_worker pid 87772: DEBUG:  authenticate kind = 0
2025-11-06 18:19:45.831: sr_check_worker pid 87772: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:45.831: sr_check_worker pid 87772: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:45.832: sr_check_worker pid 87772: DEBUG:  authenticate kind = 0
2025-11-06 18:19:45.832: sr_check_worker pid 87772: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:45.832: sr_check_worker pid 87772: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:45.832: sr_check_worker pid 87772: DEBUG:  executing replication delay command: ./delay_cmd_slow.sh localhost:11003 localhost:11004
2025-11-06 18:19:45.832: sr_check_worker pid 87772: CONTEXT:  while checking replication time lag
2025-11-06 18:19:45.834: main pid 87757: LOG:  pgpool-II successfully started. version 4.7devel (tasukiboshi)
2025-11-06 18:19:45.835: main pid 87757: LOG:  node status[0]: 1
2025-11-06 18:19:45.835: main pid 87757: LOG:  node status[1]: 2
2025-11-06 18:19:45.835: main pid 87757: LOG:  node status[2]: 2
2025-11-06 18:19:46.735: child pid 87766: DEBUG:  I am 87766 accept fd 7
2025-11-06 18:19:46.735: child pid 87766: DEBUG:  reading startup packet
2025-11-06 18:19:46.735: child pid 87766: DETAIL:  application_name: psql
2025-11-06 18:19:46.735: child pid 87766: DEBUG:  reading startup packet
2025-11-06 18:19:46.735: child pid 87766: DETAIL:  Protocol Major: 3 Minor: 0 database: test user: t-ishii
2025-11-06 18:19:46.735: child pid 87766: DEBUG:  creating new connection to backend
2025-11-06 18:19:46.735: child pid 87766: DETAIL:  connecting 0 backend
2025-11-06 18:19:46.735: child pid 87766: DEBUG:  creating new connection to backend
2025-11-06 18:19:46.735: child pid 87766: DETAIL:  connecting 1 backend
2025-11-06 18:19:46.735: child pid 87766: DEBUG:  creating new connection to backend
2025-11-06 18:19:46.735: child pid 87766: DETAIL:  connecting 2 backend
2025-11-06 18:19:46.751: child pid 87766: DEBUG:  authentication backend
2025-11-06 18:19:46.751: child pid 87766: DETAIL:  auth kind:0
2025-11-06 18:19:46.751: child pid 87766: DEBUG:  reading message length
2025-11-06 18:19:46.751: child pid 87766: DETAIL:  message length (22) in slot 1 does not match with slot 0(23)
2025-11-06 18:19:46.751: child pid 87766: DEBUG:  reading message length
2025-11-06 18:19:46.751: child pid 87766: DETAIL:  message length (22) in slot 2 does not match with slot 0(23)
2025-11-06 18:19:46.751: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.751: child pid 87766: DETAIL:  backend:0 name:"in_hot_standby" value:"off"
2025-11-06 18:19:46.751: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.751: child pid 87766: DETAIL:  backend:1 name:"in_hot_standby" value:"on"
2025-11-06 18:19:46.751: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.751: child pid 87766: DETAIL:  backend:2 name:"in_hot_standby" value:"on"
2025-11-06 18:19:46.751: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.751: child pid 87766: DETAIL:  backend:0 name:"integer_datetimes" value:"on"
2025-11-06 18:19:46.751: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.751: child pid 87766: DETAIL:  backend:1 name:"integer_datetimes" value:"on"
2025-11-06 18:19:46.751: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.751: child pid 87766: DETAIL:  backend:2 name:"integer_datetimes" value:"on"
2025-11-06 18:19:46.751: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.751: child pid 87766: DETAIL:  backend:0 name:"TimeZone" value:"Asia/Tokyo"
2025-11-06 18:19:46.751: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.751: child pid 87766: DETAIL:  backend:1 name:"TimeZone" value:"Asia/Tokyo"
2025-11-06 18:19:46.751: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.751: child pid 87766: DETAIL:  backend:2 name:"TimeZone" value:"Asia/Tokyo"
2025-11-06 18:19:46.751: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.751: child pid 87766: DETAIL:  backend:0 name:"IntervalStyle" value:"postgres"
2025-11-06 18:19:46.752: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: child pid 87766: DETAIL:  backend:1 name:"IntervalStyle" value:"postgres"
2025-11-06 18:19:46.752: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: child pid 87766: DETAIL:  backend:2 name:"IntervalStyle" value:"postgres"
2025-11-06 18:19:46.752: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: child pid 87766: DETAIL:  backend:0 name:"search_path" value:""$user", public"
2025-11-06 18:19:46.752: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: child pid 87766: DETAIL:  backend:1 name:"search_path" value:""$user", public"
2025-11-06 18:19:46.752: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: child pid 87766: DETAIL:  backend:2 name:"search_path" value:""$user", public"
2025-11-06 18:19:46.752: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: child pid 87766: DETAIL:  backend:0 name:"is_superuser" value:"on"
2025-11-06 18:19:46.752: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: child pid 87766: DETAIL:  backend:1 name:"is_superuser" value:"on"
2025-11-06 18:19:46.752: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: child pid 87766: DETAIL:  backend:2 name:"is_superuser" value:"on"
2025-11-06 18:19:46.752: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: child pid 87766: DETAIL:  backend:0 name:"application_name" value:"psql"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:1 name:"application_name" value:"psql"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:2 name:"application_name" value:"psql"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:0 name:"default_transaction_read_only" value:"off"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:1 name:"default_transaction_read_only" value:"off"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:2 name:"default_transaction_read_only" value:"off"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:0 name:"scram_iterations" value:"4096"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:1 name:"scram_iterations" value:"4096"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:2 name:"scram_iterations" value:"4096"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:0 name:"DateStyle" value:"ISO, MDY"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:1 name:"DateStyle" value:"ISO, MDY"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:2 name:"DateStyle" value:"ISO, MDY"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:0 name:"standard_conforming_strings" value:"on"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:1 name:"standard_conforming_strings" value:"on"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:2 name:"standard_conforming_strings" value:"on"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:0 name:"session_authorization" value:"t-ishii"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:1 name:"session_authorization" value:"t-ishii"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:2 name:"session_authorization" value:"t-ishii"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:0 name:"client_encoding" value:"UTF8"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:1 name:"client_encoding" value:"UTF8"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:2 name:"client_encoding" value:"UTF8"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:0 name:"server_version" value:"18.0"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:1 name:"server_version" value:"18.0"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:2 name:"server_version" value:"18.0"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:0 name:"server_encoding" value:"UTF8"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:1 name:"server_encoding" value:"UTF8"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:2 name:"server_encoding" value:"UTF8"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  cancel key length: 4
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  authentication backend
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  cp->info[i]:0x726410677c08 pid:87787
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  authentication backend
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  cp->info[i]:0x726410677da8 pid:87786
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  authentication backend
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  cp->info[i]:0x726410677f48 pid:87788
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  sending backend key data
2025-11-06 18:19:46.755: psql pid 87766: DEBUG:  selecting load balance node
2025-11-06 18:19:46.755: psql pid 87766: DETAIL:  selected backend id is 1
2025-11-06 18:19:46.756: psql pid 87766: LOG:  DB node id: 0 backend pid: 87787 statement: SELECT pg_catalog.version()
2025-11-06 18:19:46.756: psql pid 87766: DEBUG:  memcache encode key
2025-11-06 18:19:46.756: psql pid 87766: DETAIL:  username: "t-ishii" database_name: "test"
2025-11-06 18:19:46.756: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.756: psql pid 87766: DEBUG:  memcache encode key
2025-11-06 18:19:46.756: psql pid 87766: DETAIL:  query: "SELECT pg_catalog.version()"
2025-11-06 18:19:46.756: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.756: psql pid 87766: DEBUG:  memcache encode key
2025-11-06 18:19:46.756: psql pid 87766: DETAIL:  `t-ishiiSELECT pg_catalog.version()test' -> `c8645f9bdf015b6b5ee4667cb578f1b3'
2025-11-06 18:19:46.756: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.756: psql pid 87766: DEBUG:  fetching from cache storage
2025-11-06 18:19:46.756: psql pid 87766: DETAIL:  search key "c8645f9bdf015b6b5ee4667cb578f1b3"
2025-11-06 18:19:46.756: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.756: psql pid 87766: DEBUG:  fetching from cache storage
2025-11-06 18:19:46.756: psql pid 87766: DETAIL:  cache not found on shared memory
2025-11-06 18:19:46.756: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.756: psql pid 87766: DEBUG:  not hit local relation cache and query cache
2025-11-06 18:19:46.756: psql pid 87766: DETAIL:  query:SELECT pg_catalog.version()
2025-11-06 18:19:46.756: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.756: psql pid 87766: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.version()"
2025-11-06 18:19:46.756: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.757: psql pid 87766: DEBUG:  committing relation cache to cache storage
2025-11-06 18:19:46.757: psql pid 87766: DETAIL:  Query="SELECT pg_catalog.version()"
2025-11-06 18:19:46.757: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.757: psql pid 87766: DEBUG:  memcache encode key
2025-11-06 18:19:46.757: psql pid 87766: DETAIL:  username: "t-ishii" database_name: "test"
2025-11-06 18:19:46.757: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.757: psql pid 87766: DEBUG:  memcache encode key
2025-11-06 18:19:46.757: psql pid 87766: DETAIL:  query: "SELECT pg_catalog.version()"
2025-11-06 18:19:46.757: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.757: psql pid 87766: DEBUG:  memcache encode key
2025-11-06 18:19:46.757: psql pid 87766: DETAIL:  `t-ishiiSELECT pg_catalog.version()test' -> `c8645f9bdf015b6b5ee4667cb578f1b3'
2025-11-06 18:19:46.757: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.757: psql pid 87766: DEBUG:  committing relation cache to cache storage
2025-11-06 18:19:46.757: psql pid 87766: DETAIL:  memqcache_expire = 0
2025-11-06 18:19:46.757: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.758: psql pid 87766: DEBUG:  memcache adding item
2025-11-06 18:19:46.758: psql pid 87766: DETAIL:  new item inserted. blockid: 0 itemid:0
2025-11-06 18:19:46.758: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.758: psql pid 87766: DEBUG:  memcache adding item
2025-11-06 18:19:46.758: psql pid 87766: DETAIL:  block: 0 item: 0
2025-11-06 18:19:46.758: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.758: psql pid 87766: DEBUG:  SimpleQuery
2025-11-06 18:19:46.758: psql pid 87766: DETAIL:  nodes reporting
2025-11-06 18:19:46.761: psql pid 87766: DEBUG:  authenticate kind = 0
2025-11-06 18:19:46.761: psql pid 87766: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:46.761: psql pid 87766: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:46.761: psql pid 87766: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-06 18:19:46.765: psql pid 87766: DEBUG:  authenticate kind = 0
2025-11-06 18:19:46.765: psql pid 87766: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:46.765: psql pid 87766: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:46.765: psql pid 87766: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-06 18:19:46.768: psql pid 87766: DEBUG:  authenticate kind = 0
2025-11-06 18:19:46.768: psql pid 87766: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:46.768: psql pid 87766: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:46.768: psql pid 87766: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-06 18:19:46.768: psql pid 87766: DEBUG:  decide where to send the query
2025-11-06 18:19:46.768: psql pid 87766: DETAIL:  destination = 3 for query= "DISCARD ALL"
2025-11-06 18:19:46.768: psql pid 87766: LOG:  DB node id: 0 backend pid: 87787 statement: DISCARD ALL
2025-11-06 18:19:46.768: psql pid 87766: DEBUG:  waiting for query response
2025-11-06 18:19:46.768: psql pid 87766: DETAIL:  waiting for backend:0 to complete the query
2025-11-06 18:19:46.768: psql pid 87766: LOG:  DB node id: 1 backend pid: 87786 statement: DISCARD ALL
2025-11-06 18:19:46.768: psql pid 87766: DEBUG:  waiting for query response
2025-11-06 18:19:46.768: psql pid 87766: DETAIL:  waiting for backend:1 to complete the query
2025-11-06 18:19:46.768: psql pid 87766: DEBUG:  setting backend connection close timer
2025-11-06 18:19:46.768: psql pid 87766: DETAIL:  close time 1762420786
2025-11-06 18:19:55.830: health_check1 pid 87774: DEBUG:  health check: clearing alarm
2025-11-06 18:19:55.830: health_check2 pid 87775: DEBUG:  health check: clearing alarm
2025-11-06 18:19:55.830: health_check0 pid 87773: DEBUG:  health check: clearing alarm
2025-11-06 18:19:55.831: health_check1 pid 87774: DEBUG:  authenticate kind = 0
2025-11-06 18:19:55.831: health_check1 pid 87774: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:55.831: health_check1 pid 87774: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:55.831: health_check1 pid 87774: DEBUG:  health check: clearing alarm
2025-11-06 18:19:55.831: health_check1 pid 87774: DEBUG:  health check: clearing alarm
2025-11-06 18:19:55.831: health_check1 pid 87774: WARNING:  check_backend_down_request: failed to open file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/backend_down_request
2025-11-06 18:19:55.831: health_check1 pid 87774: DETAIL:  No such file or directory
2025-11-06 18:19:55.831: health_check0 pid 87773: DEBUG:  authenticate kind = 0
2025-11-06 18:19:55.831: health_check0 pid 87773: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:55.831: health_check0 pid 87773: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:55.831: health_check0 pid 87773: DEBUG:  health check: clearing alarm
2025-11-06 18:19:55.831: health_check0 pid 87773: DEBUG:  health check: clearing alarm
2025-11-06 18:19:55.832: health_check0 pid 87773: WARNING:  check_backend_down_request: failed to open file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/backend_down_request
2025-11-06 18:19:55.832: health_check0 pid 87773: DETAIL:  No such file or directory
2025-11-06 18:19:55.832: health_check2 pid 87775: DEBUG:  authenticate kind = 0
2025-11-06 18:19:55.832: health_check2 pid 87775: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:55.832: health_check2 pid 87775: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:55.832: health_check2 pid 87775: DEBUG:  health check: clearing alarm
2025-11-06 18:19:55.832: health_check2 pid 87775: DEBUG:  health check: clearing alarm
2025-11-06 18:19:55.832: health_check2 pid 87775: WARNING:  check_backend_down_request: failed to open file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/backend_down_request
2025-11-06 18:19:55.832: health_check2 pid 87775: DETAIL:  No such file or directory
2025-11-06 18:20:00.835: sr_check_worker pid 87772: LOG:  Replication of node: 1 is behind 0.025 second(s) from the primary server (node: 0) [external command]
2025-11-06 18:20:00.835: sr_check_worker pid 87772: CONTEXT:  while checking replication time lag
2025-11-06 18:20:00.835: sr_check_worker pid 87772: LOG:  Replication of node: 2 is behind 0.050 second(s) from the primary server (node: 0) [external command]
2025-11-06 18:20:00.835: sr_check_worker pid 87772: CONTEXT:  while checking replication time lag
2025-11-06 18:20:00.835: sr_check_worker pid 87772: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-06 18:20:00.835: sr_check_worker pid 87772: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-06 18:20:00.836: sr_check_worker pid 87772: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-06 18:20:00.836: sr_check_worker pid 87772: DEBUG:  verify_backend_node_status: multiple standbys: 2
2025-11-06 18:20:00.836: sr_check_worker pid 87772: DEBUG:  verify_backend_node_status: detach_false_primary is off and no additional checking is performed
2025-11-06 18:20:00.836: sr_check_worker pid 87772: DEBUG:  node status[0]: 1
2025-11-06 18:20:00.836: sr_check_worker pid 87772: DEBUG:  node status[1]: 2
2025-11-06 18:20:00.836: sr_check_worker pid 87772: DEBUG:  node status[2]: 2
2025-11-06 18:20:00.836: sr_check_worker pid 87772: DEBUG:  pool_release_follow_primary_lock called
2025-11-06 18:20:01.810: main pid 87757: LOG:  exit handler called (signal: 2)
2025-11-06 18:20:01.810: main pid 87757: LOG:  shutting down by signal 2
2025-11-06 18:20:01.810: main pid 87757: LOG:  terminating all child processes
2025-11-06 18:20:01.813: main pid 87757: LOG:  Pgpool-II system is shutdown


Attachments:

  [text/plain] pgpool.log (31.6K, 2-pgpool.log)
  download | inline:
2025-11-06 18:19:45.730: main pid 87757: DEBUG:  initializing pool configuration
2025-11-06 18:19:45.730: main pid 87757: DETAIL:  num_backends: 3 total_weight: 1.000000
2025-11-06 18:19:45.730: main pid 87757: DEBUG:  initializing pool configuration
2025-11-06 18:19:45.730: main pid 87757: DETAIL:  backend 0 weight: 0.000000 flag: 0000
2025-11-06 18:19:45.730: main pid 87757: DEBUG:  initializing pool configuration
2025-11-06 18:19:45.730: main pid 87757: DETAIL:  backend 1 weight: 2147483647.000000 flag: 0000
2025-11-06 18:19:45.730: main pid 87757: DEBUG:  initializing pool configuration
2025-11-06 18:19:45.730: main pid 87757: DETAIL:  backend 2 weight: 0.000000 flag: 0000
2025-11-06 18:19:45.737: main pid 87757: LOG:  Backend status file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/pgpool_status discarded
2025-11-06 18:19:45.737: main pid 87757: DEBUG:  BackendDesc: 113672 bytes requested for shared memory
2025-11-06 18:19:45.737: main pid 87757: DEBUG:  pool_coninfo_size: num_init_children (4) * max_pool (2) * MAX_NUM_BACKENDS (128) * sizeof(ConnectionInfo) (416) = 425984 bytes requested for shared memory
2025-11-06 18:19:45.737: main pid 87757: DEBUG:  ProcessInfo: num_init_children (4) * sizeof(ProcessInfo) (2152) = 8608 bytes requested for shared memory
2025-11-06 18:19:45.737: main pid 87757: DEBUG:  UserSignalSlot: 24 bytes requested for shared memory
2025-11-06 18:19:45.737: main pid 87757: DEBUG:  POOL_REQUEST_INFO: 5272 bytes requested for shared memory
2025-11-06 18:19:45.737: main pid 87757: DEBUG:  stat_shared_memory_size: 9216 bytes requested for shared memory
2025-11-06 18:19:45.737: main pid 87757: DEBUG:  health_check_stats_shared_memory_size: requested size: 12288
2025-11-06 18:19:45.737: main pid 87757: DEBUG:  SI_ManageInfo: 24 bytes requested for shared memory
2025-11-06 18:19:45.737: main pid 87757: DEBUG:  memcache blocks :64
2025-11-06 18:19:45.737: main pid 87757: DEBUG:  shared_memory_cache_size: 67108864
2025-11-06 18:19:45.737: main pid 87757: DEBUG:  shared_memory_fsmm_size: 64
2025-11-06 18:19:45.737: main pid 87757: DEBUG:  pool_hash_size: 67108880
2025-11-06 18:19:45.737: main pid 87757: DEBUG:  POOL_QUERY_CACHE_STATS: 24 bytes requested for shared memory
2025-11-06 18:19:45.737: main pid 87757: LOG:  allocating shared memory segment of size: 134793216 
2025-11-06 18:19:45.803: main pid 87757: DEBUG:  pool_coninfo_size: num_init_children (4) * max_pool (2) * MAX_NUM_BACKENDS (128) * sizeof(ConnectionInfo) (416) = 425984 bytes requested for shared memory
2025-11-06 18:19:45.803: main pid 87757: DEBUG:  health_check_stats_shared_memory_size: requested size: 12288
2025-11-06 18:19:45.803: main pid 87757: DEBUG:  health_check_stats_shared_memory_size: requested size: 12288
2025-11-06 18:19:45.803: main pid 87757: DEBUG:  memcache blocks :64
2025-11-06 18:19:45.803: main pid 87757: DEBUG:  shared_memory_cache_size: 67108864
2025-11-06 18:19:45.803: main pid 87757: DEBUG:  memory cache request size : 67108864
2025-11-06 18:19:45.803: main pid 87757: DEBUG:  shared_memory_fsmm_size: 64
2025-11-06 18:19:45.805: main pid 87757: LOG:  pool_discard_oid_maps: discarded memqcache oid maps
2025-11-06 18:19:45.812: main pid 87757: LOG:  create socket files[0]: /tmp/.s.PGSQL.11000
2025-11-06 18:19:45.812: main pid 87757: LOG:  listen address[0]: *
2025-11-06 18:19:45.813: main pid 87757: LOG:  Setting up socket for 0.0.0.0:11000
2025-11-06 18:19:45.813: main pid 87757: LOG:  Setting up socket for :::11000
2025-11-06 18:19:45.813: main pid 87757: DEBUG:  Spawning 4 child processes
2025-11-06 18:19:45.813: child pid 87764: DEBUG:  initializing backend status
2025-11-06 18:19:45.813: child pid 87765: DEBUG:  initializing backend status
2025-11-06 18:19:45.814: main pid 87757: LOG:  find_primary_node_repeatedly: waiting for finding a primary node
2025-11-06 18:19:45.814: child pid 87766: DEBUG:  initializing backend status
2025-11-06 18:19:45.814: child pid 87767: DEBUG:  initializing backend status
2025-11-06 18:19:45.819: main pid 87757: DEBUG:  authenticate kind = 0
2025-11-06 18:19:45.819: main pid 87757: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:45.819: main pid 87757: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:45.823: main pid 87757: DEBUG:  authenticate kind = 0
2025-11-06 18:19:45.823: main pid 87757: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:45.823: main pid 87757: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:45.826: main pid 87757: DEBUG:  authenticate kind = 0
2025-11-06 18:19:45.826: main pid 87757: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:45.826: main pid 87757: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:45.826: main pid 87757: DEBUG:  pool_acquire_follow_primary_lock: lock was not held by anyone
2025-11-06 18:19:45.826: main pid 87757: DEBUG:  pool_acquire_follow_primary_lock: succeeded in acquiring lock
2025-11-06 18:19:45.826: main pid 87757: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-06 18:19:45.826: main pid 87757: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.current_setting('server_version_num')"
2025-11-06 18:19:45.826: main pid 87757: DEBUG:  get_server_version: backend 0 server version: 180000
2025-11-06 18:19:45.826: main pid 87757: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-06 18:19:45.827: main pid 87757: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.current_setting('server_version_num')"
2025-11-06 18:19:45.827: main pid 87757: DEBUG:  get_server_version: backend 1 server version: 180000
2025-11-06 18:19:45.827: main pid 87757: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-06 18:19:45.827: main pid 87757: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.current_setting('server_version_num')"
2025-11-06 18:19:45.827: main pid 87757: DEBUG:  get_server_version: backend 2 server version: 180000
2025-11-06 18:19:45.827: main pid 87757: DEBUG:  verify_backend_node_status: multiple standbys: 2
2025-11-06 18:19:45.827: main pid 87757: DEBUG:  verify_backend_node_status: detach_false_primary is off and no additional checking is performed
2025-11-06 18:19:45.827: main pid 87757: DEBUG:  pool_release_follow_primary_lock called
2025-11-06 18:19:45.827: main pid 87757: LOG:  find_primary_node: primary node is 0
2025-11-06 18:19:45.827: main pid 87757: LOG:  find_primary_node: standby node is 1
2025-11-06 18:19:45.827: main pid 87757: LOG:  find_primary_node: standby node is 2
2025-11-06 18:19:45.827: main pid 87757: LOG:  create socket files[0]: /tmp/.s.PGSQL.11001
2025-11-06 18:19:45.827: main pid 87757: LOG:  listen address[0]: localhost
2025-11-06 18:19:45.827: main pid 87757: LOG:  Setting up socket for 127.0.0.1:11001
2025-11-06 18:19:45.827: pcp_main pid 87771: DEBUG:  I am PCP child with pid:87771
2025-11-06 18:19:45.828: pcp_main pid 87771: LOG:  PCP process: 87771 started
2025-11-06 18:19:45.828: sr_check_worker pid 87772: LOG:  process started
2025-11-06 18:19:45.828: sr_check_worker pid 87772: DEBUG:  I am 87772
2025-11-06 18:19:45.828: sr_check_worker pid 87772: DEBUG:  pool_acquire_follow_primary_lock: lock was not held by anyone
2025-11-06 18:19:45.828: sr_check_worker pid 87772: DEBUG:  pool_acquire_follow_primary_lock: succeeded in acquiring lock
2025-11-06 18:19:45.828: health_check pid 87773: LOG:  process started
2025-11-06 18:19:45.828: health_check0 pid 87773: DEBUG:  I am health check process pid:87773 DB node id:0
2025-11-06 18:19:45.828: health_check0 pid 87773: DEBUG:  health check: clearing alarm
2025-11-06 18:19:45.828: health_check pid 87774: LOG:  process started
2025-11-06 18:19:45.828: health_check1 pid 87774: DEBUG:  I am health check process pid:87774 DB node id:1
2025-11-06 18:19:45.828: health_check1 pid 87774: DEBUG:  health check: clearing alarm
2025-11-06 18:19:45.828: health_check pid 87775: LOG:  process started
2025-11-06 18:19:45.828: health_check2 pid 87775: DEBUG:  I am health check process pid:87775 DB node id:2
2025-11-06 18:19:45.828: health_check2 pid 87775: DEBUG:  health check: clearing alarm
2025-11-06 18:19:45.829: health_check1 pid 87774: DEBUG:  authenticate kind = 0
2025-11-06 18:19:45.829: health_check1 pid 87774: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:45.829: health_check1 pid 87774: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:45.829: health_check1 pid 87774: DEBUG:  health check: clearing alarm
2025-11-06 18:19:45.829: health_check1 pid 87774: DEBUG:  health check: clearing alarm
2025-11-06 18:19:45.829: health_check1 pid 87774: WARNING:  check_backend_down_request: failed to open file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/backend_down_request
2025-11-06 18:19:45.829: health_check1 pid 87774: DETAIL:  No such file or directory
2025-11-06 18:19:45.830: health_check0 pid 87773: DEBUG:  authenticate kind = 0
2025-11-06 18:19:45.830: health_check0 pid 87773: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:45.830: health_check0 pid 87773: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:45.830: health_check0 pid 87773: DEBUG:  health check: clearing alarm
2025-11-06 18:19:45.830: health_check0 pid 87773: DEBUG:  health check: clearing alarm
2025-11-06 18:19:45.830: health_check0 pid 87773: WARNING:  check_backend_down_request: failed to open file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/backend_down_request
2025-11-06 18:19:45.830: health_check0 pid 87773: DETAIL:  No such file or directory
2025-11-06 18:19:45.830: sr_check_worker pid 87772: DEBUG:  authenticate kind = 0
2025-11-06 18:19:45.830: health_check2 pid 87775: DEBUG:  authenticate kind = 0
2025-11-06 18:19:45.830: sr_check_worker pid 87772: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:45.830: health_check2 pid 87775: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:45.830: sr_check_worker pid 87772: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:45.830: health_check2 pid 87775: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:45.830: health_check2 pid 87775: DEBUG:  health check: clearing alarm
2025-11-06 18:19:45.830: health_check2 pid 87775: DEBUG:  health check: clearing alarm
2025-11-06 18:19:45.830: health_check2 pid 87775: WARNING:  check_backend_down_request: failed to open file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/backend_down_request
2025-11-06 18:19:45.830: health_check2 pid 87775: DETAIL:  No such file or directory
2025-11-06 18:19:45.831: sr_check_worker pid 87772: DEBUG:  authenticate kind = 0
2025-11-06 18:19:45.831: sr_check_worker pid 87772: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:45.831: sr_check_worker pid 87772: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:45.832: sr_check_worker pid 87772: DEBUG:  authenticate kind = 0
2025-11-06 18:19:45.832: sr_check_worker pid 87772: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:45.832: sr_check_worker pid 87772: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:45.832: sr_check_worker pid 87772: DEBUG:  executing replication delay command: ./delay_cmd_slow.sh localhost:11003 localhost:11004
2025-11-06 18:19:45.832: sr_check_worker pid 87772: CONTEXT:  while checking replication time lag
2025-11-06 18:19:45.834: main pid 87757: LOG:  pgpool-II successfully started. version 4.7devel (tasukiboshi)
2025-11-06 18:19:45.835: main pid 87757: LOG:  node status[0]: 1
2025-11-06 18:19:45.835: main pid 87757: LOG:  node status[1]: 2
2025-11-06 18:19:45.835: main pid 87757: LOG:  node status[2]: 2
2025-11-06 18:19:46.735: child pid 87766: DEBUG:  I am 87766 accept fd 7
2025-11-06 18:19:46.735: child pid 87766: DEBUG:  reading startup packet
2025-11-06 18:19:46.735: child pid 87766: DETAIL:  application_name: psql
2025-11-06 18:19:46.735: child pid 87766: DEBUG:  reading startup packet
2025-11-06 18:19:46.735: child pid 87766: DETAIL:  Protocol Major: 3 Minor: 0 database: test user: t-ishii
2025-11-06 18:19:46.735: child pid 87766: DEBUG:  creating new connection to backend
2025-11-06 18:19:46.735: child pid 87766: DETAIL:  connecting 0 backend
2025-11-06 18:19:46.735: child pid 87766: DEBUG:  creating new connection to backend
2025-11-06 18:19:46.735: child pid 87766: DETAIL:  connecting 1 backend
2025-11-06 18:19:46.735: child pid 87766: DEBUG:  creating new connection to backend
2025-11-06 18:19:46.735: child pid 87766: DETAIL:  connecting 2 backend
2025-11-06 18:19:46.751: child pid 87766: DEBUG:  authentication backend
2025-11-06 18:19:46.751: child pid 87766: DETAIL:  auth kind:0
2025-11-06 18:19:46.751: child pid 87766: DEBUG:  reading message length
2025-11-06 18:19:46.751: child pid 87766: DETAIL:  message length (22) in slot 1 does not match with slot 0(23)
2025-11-06 18:19:46.751: child pid 87766: DEBUG:  reading message length
2025-11-06 18:19:46.751: child pid 87766: DETAIL:  message length (22) in slot 2 does not match with slot 0(23)
2025-11-06 18:19:46.751: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.751: child pid 87766: DETAIL:  backend:0 name:"in_hot_standby" value:"off"
2025-11-06 18:19:46.751: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.751: child pid 87766: DETAIL:  backend:1 name:"in_hot_standby" value:"on"
2025-11-06 18:19:46.751: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.751: child pid 87766: DETAIL:  backend:2 name:"in_hot_standby" value:"on"
2025-11-06 18:19:46.751: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.751: child pid 87766: DETAIL:  backend:0 name:"integer_datetimes" value:"on"
2025-11-06 18:19:46.751: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.751: child pid 87766: DETAIL:  backend:1 name:"integer_datetimes" value:"on"
2025-11-06 18:19:46.751: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.751: child pid 87766: DETAIL:  backend:2 name:"integer_datetimes" value:"on"
2025-11-06 18:19:46.751: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.751: child pid 87766: DETAIL:  backend:0 name:"TimeZone" value:"Asia/Tokyo"
2025-11-06 18:19:46.751: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.751: child pid 87766: DETAIL:  backend:1 name:"TimeZone" value:"Asia/Tokyo"
2025-11-06 18:19:46.751: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.751: child pid 87766: DETAIL:  backend:2 name:"TimeZone" value:"Asia/Tokyo"
2025-11-06 18:19:46.751: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.751: child pid 87766: DETAIL:  backend:0 name:"IntervalStyle" value:"postgres"
2025-11-06 18:19:46.752: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: child pid 87766: DETAIL:  backend:1 name:"IntervalStyle" value:"postgres"
2025-11-06 18:19:46.752: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: child pid 87766: DETAIL:  backend:2 name:"IntervalStyle" value:"postgres"
2025-11-06 18:19:46.752: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: child pid 87766: DETAIL:  backend:0 name:"search_path" value:""$user", public"
2025-11-06 18:19:46.752: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: child pid 87766: DETAIL:  backend:1 name:"search_path" value:""$user", public"
2025-11-06 18:19:46.752: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: child pid 87766: DETAIL:  backend:2 name:"search_path" value:""$user", public"
2025-11-06 18:19:46.752: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: child pid 87766: DETAIL:  backend:0 name:"is_superuser" value:"on"
2025-11-06 18:19:46.752: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: child pid 87766: DETAIL:  backend:1 name:"is_superuser" value:"on"
2025-11-06 18:19:46.752: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: child pid 87766: DETAIL:  backend:2 name:"is_superuser" value:"on"
2025-11-06 18:19:46.752: child pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: child pid 87766: DETAIL:  backend:0 name:"application_name" value:"psql"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:1 name:"application_name" value:"psql"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:2 name:"application_name" value:"psql"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:0 name:"default_transaction_read_only" value:"off"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:1 name:"default_transaction_read_only" value:"off"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:2 name:"default_transaction_read_only" value:"off"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:0 name:"scram_iterations" value:"4096"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:1 name:"scram_iterations" value:"4096"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:2 name:"scram_iterations" value:"4096"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:0 name:"DateStyle" value:"ISO, MDY"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:1 name:"DateStyle" value:"ISO, MDY"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:2 name:"DateStyle" value:"ISO, MDY"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:0 name:"standard_conforming_strings" value:"on"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:1 name:"standard_conforming_strings" value:"on"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:2 name:"standard_conforming_strings" value:"on"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:0 name:"session_authorization" value:"t-ishii"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:1 name:"session_authorization" value:"t-ishii"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:2 name:"session_authorization" value:"t-ishii"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:0 name:"client_encoding" value:"UTF8"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:1 name:"client_encoding" value:"UTF8"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:2 name:"client_encoding" value:"UTF8"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:0 name:"server_version" value:"18.0"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:1 name:"server_version" value:"18.0"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:2 name:"server_version" value:"18.0"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:0 name:"server_encoding" value:"UTF8"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:1 name:"server_encoding" value:"UTF8"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  process parameter status
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  backend:2 name:"server_encoding" value:"UTF8"
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  cancel key length: 4
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  authentication backend
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  cp->info[i]:0x726410677c08 pid:87787
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  authentication backend
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  cp->info[i]:0x726410677da8 pid:87786
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  authentication backend
2025-11-06 18:19:46.752: psql pid 87766: DETAIL:  cp->info[i]:0x726410677f48 pid:87788
2025-11-06 18:19:46.752: psql pid 87766: DEBUG:  sending backend key data
2025-11-06 18:19:46.755: psql pid 87766: DEBUG:  selecting load balance node
2025-11-06 18:19:46.755: psql pid 87766: DETAIL:  selected backend id is 1
2025-11-06 18:19:46.756: psql pid 87766: LOG:  DB node id: 0 backend pid: 87787 statement: SELECT pg_catalog.version()
2025-11-06 18:19:46.756: psql pid 87766: DEBUG:  memcache encode key
2025-11-06 18:19:46.756: psql pid 87766: DETAIL:  username: "t-ishii" database_name: "test"
2025-11-06 18:19:46.756: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.756: psql pid 87766: DEBUG:  memcache encode key
2025-11-06 18:19:46.756: psql pid 87766: DETAIL:  query: "SELECT pg_catalog.version()"
2025-11-06 18:19:46.756: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.756: psql pid 87766: DEBUG:  memcache encode key
2025-11-06 18:19:46.756: psql pid 87766: DETAIL:  `t-ishiiSELECT pg_catalog.version()test' -> `c8645f9bdf015b6b5ee4667cb578f1b3'
2025-11-06 18:19:46.756: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.756: psql pid 87766: DEBUG:  fetching from cache storage
2025-11-06 18:19:46.756: psql pid 87766: DETAIL:  search key "c8645f9bdf015b6b5ee4667cb578f1b3"
2025-11-06 18:19:46.756: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.756: psql pid 87766: DEBUG:  fetching from cache storage
2025-11-06 18:19:46.756: psql pid 87766: DETAIL:  cache not found on shared memory
2025-11-06 18:19:46.756: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.756: psql pid 87766: DEBUG:  not hit local relation cache and query cache
2025-11-06 18:19:46.756: psql pid 87766: DETAIL:  query:SELECT pg_catalog.version()
2025-11-06 18:19:46.756: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.756: psql pid 87766: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.version()"
2025-11-06 18:19:46.756: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.757: psql pid 87766: DEBUG:  committing relation cache to cache storage
2025-11-06 18:19:46.757: psql pid 87766: DETAIL:  Query="SELECT pg_catalog.version()"
2025-11-06 18:19:46.757: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.757: psql pid 87766: DEBUG:  memcache encode key
2025-11-06 18:19:46.757: psql pid 87766: DETAIL:  username: "t-ishii" database_name: "test"
2025-11-06 18:19:46.757: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.757: psql pid 87766: DEBUG:  memcache encode key
2025-11-06 18:19:46.757: psql pid 87766: DETAIL:  query: "SELECT pg_catalog.version()"
2025-11-06 18:19:46.757: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.757: psql pid 87766: DEBUG:  memcache encode key
2025-11-06 18:19:46.757: psql pid 87766: DETAIL:  `t-ishiiSELECT pg_catalog.version()test' -> `c8645f9bdf015b6b5ee4667cb578f1b3'
2025-11-06 18:19:46.757: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.757: psql pid 87766: DEBUG:  committing relation cache to cache storage
2025-11-06 18:19:46.757: psql pid 87766: DETAIL:  memqcache_expire = 0
2025-11-06 18:19:46.757: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.758: psql pid 87766: DEBUG:  memcache adding item
2025-11-06 18:19:46.758: psql pid 87766: DETAIL:  new item inserted. blockid: 0 itemid:0
2025-11-06 18:19:46.758: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.758: psql pid 87766: DEBUG:  memcache adding item
2025-11-06 18:19:46.758: psql pid 87766: DETAIL:  block: 0 item: 0
2025-11-06 18:19:46.758: psql pid 87766: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-06 18:19:46.758: psql pid 87766: DEBUG:  SimpleQuery
2025-11-06 18:19:46.758: psql pid 87766: DETAIL:  nodes reporting
2025-11-06 18:19:46.761: psql pid 87766: DEBUG:  authenticate kind = 0
2025-11-06 18:19:46.761: psql pid 87766: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:46.761: psql pid 87766: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:46.761: psql pid 87766: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-06 18:19:46.765: psql pid 87766: DEBUG:  authenticate kind = 0
2025-11-06 18:19:46.765: psql pid 87766: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:46.765: psql pid 87766: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:46.765: psql pid 87766: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-06 18:19:46.768: psql pid 87766: DEBUG:  authenticate kind = 0
2025-11-06 18:19:46.768: psql pid 87766: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:46.768: psql pid 87766: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:46.768: psql pid 87766: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-06 18:19:46.768: psql pid 87766: DEBUG:  decide where to send the query
2025-11-06 18:19:46.768: psql pid 87766: DETAIL:  destination = 3 for query= "DISCARD ALL"
2025-11-06 18:19:46.768: psql pid 87766: LOG:  DB node id: 0 backend pid: 87787 statement: DISCARD ALL
2025-11-06 18:19:46.768: psql pid 87766: DEBUG:  waiting for query response
2025-11-06 18:19:46.768: psql pid 87766: DETAIL:  waiting for backend:0 to complete the query
2025-11-06 18:19:46.768: psql pid 87766: LOG:  DB node id: 1 backend pid: 87786 statement: DISCARD ALL
2025-11-06 18:19:46.768: psql pid 87766: DEBUG:  waiting for query response
2025-11-06 18:19:46.768: psql pid 87766: DETAIL:  waiting for backend:1 to complete the query
2025-11-06 18:19:46.768: psql pid 87766: DEBUG:  setting backend connection close timer
2025-11-06 18:19:46.768: psql pid 87766: DETAIL:  close time 1762420786
2025-11-06 18:19:55.830: health_check1 pid 87774: DEBUG:  health check: clearing alarm
2025-11-06 18:19:55.830: health_check2 pid 87775: DEBUG:  health check: clearing alarm
2025-11-06 18:19:55.830: health_check0 pid 87773: DEBUG:  health check: clearing alarm
2025-11-06 18:19:55.831: health_check1 pid 87774: DEBUG:  authenticate kind = 0
2025-11-06 18:19:55.831: health_check1 pid 87774: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:55.831: health_check1 pid 87774: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:55.831: health_check1 pid 87774: DEBUG:  health check: clearing alarm
2025-11-06 18:19:55.831: health_check1 pid 87774: DEBUG:  health check: clearing alarm
2025-11-06 18:19:55.831: health_check1 pid 87774: WARNING:  check_backend_down_request: failed to open file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/backend_down_request
2025-11-06 18:19:55.831: health_check1 pid 87774: DETAIL:  No such file or directory
2025-11-06 18:19:55.831: health_check0 pid 87773: DEBUG:  authenticate kind = 0
2025-11-06 18:19:55.831: health_check0 pid 87773: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:55.831: health_check0 pid 87773: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:55.831: health_check0 pid 87773: DEBUG:  health check: clearing alarm
2025-11-06 18:19:55.831: health_check0 pid 87773: DEBUG:  health check: clearing alarm
2025-11-06 18:19:55.832: health_check0 pid 87773: WARNING:  check_backend_down_request: failed to open file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/backend_down_request
2025-11-06 18:19:55.832: health_check0 pid 87773: DETAIL:  No such file or directory
2025-11-06 18:19:55.832: health_check2 pid 87775: DEBUG:  authenticate kind = 0
2025-11-06 18:19:55.832: health_check2 pid 87775: DEBUG:  authenticate backend: key data received
2025-11-06 18:19:55.832: health_check2 pid 87775: DEBUG:  authenticate backend: transaction state: I
2025-11-06 18:19:55.832: health_check2 pid 87775: DEBUG:  health check: clearing alarm
2025-11-06 18:19:55.832: health_check2 pid 87775: DEBUG:  health check: clearing alarm
2025-11-06 18:19:55.832: health_check2 pid 87775: WARNING:  check_backend_down_request: failed to open file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/backend_down_request
2025-11-06 18:19:55.832: health_check2 pid 87775: DETAIL:  No such file or directory
2025-11-06 18:20:00.835: sr_check_worker pid 87772: LOG:  Replication of node: 1 is behind 0.025 second(s) from the primary server (node: 0) [external command]
2025-11-06 18:20:00.835: sr_check_worker pid 87772: CONTEXT:  while checking replication time lag
2025-11-06 18:20:00.835: sr_check_worker pid 87772: LOG:  Replication of node: 2 is behind 0.050 second(s) from the primary server (node: 0) [external command]
2025-11-06 18:20:00.835: sr_check_worker pid 87772: CONTEXT:  while checking replication time lag
2025-11-06 18:20:00.835: sr_check_worker pid 87772: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-06 18:20:00.835: sr_check_worker pid 87772: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-06 18:20:00.836: sr_check_worker pid 87772: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-06 18:20:00.836: sr_check_worker pid 87772: DEBUG:  verify_backend_node_status: multiple standbys: 2
2025-11-06 18:20:00.836: sr_check_worker pid 87772: DEBUG:  verify_backend_node_status: detach_false_primary is off and no additional checking is performed
2025-11-06 18:20:00.836: sr_check_worker pid 87772: DEBUG:  node status[0]: 1
2025-11-06 18:20:00.836: sr_check_worker pid 87772: DEBUG:  node status[1]: 2
2025-11-06 18:20:00.836: sr_check_worker pid 87772: DEBUG:  node status[2]: 2
2025-11-06 18:20:00.836: sr_check_worker pid 87772: DEBUG:  pool_release_follow_primary_lock called
2025-11-06 18:20:01.810: main pid 87757: LOG:  exit handler called (signal: 2)
2025-11-06 18:20:01.810: main pid 87757: LOG:  shutting down by signal 2
2025-11-06 18:20:01.810: main pid 87757: LOG:  terminating all child processes
2025-11-06 18:20:01.813: main pid 87757: LOG:  Pgpool-II system is shutdown

^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-11-18 11:37  Nadav Shatz <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Nadav Shatz @ 2025-11-18 11:37 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

Hi Tatsuo,

Please see attached an updated version.

thank you

On Fri, Nov 7, 2025 at 2:07 AM Tatsuo Ishii <[email protected]> wrote:

> > Sorry for that - thanks for the patch.
> >
> > Please find attached a new version
>
> Thanks for the new version. Unfortunately this time regression test
> fails at:
>
> > Waiting for command timeout...
> > fail: command timeout not detected
>
> Attached is the pgpool.log.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
> > On Mon, Nov 3, 2025 at 9:05 AM Tatsuo Ishii <[email protected]>
> wrote:
> >
> >> > thanks and sorry for the issues, please find attached updated version.
> >>
> >> No problem.
> >>
> >> This time the patch applies fine, no compiler warnings.  However,
> >> regression test did not passed here (on Ubuntu 24 LTS if this
> >> matters).  So I looked into
> >> src/test/regression/tests/041.external_replication_delay/test.sh a
> >> little bit and apply attached patch (test.sh.patch). It moved forward
> >> partially but failed at:
> >>
> >> fail: command execution failure not detected
> >>
> >> Please find attached
> >>
> src/test/regression/tests/041.external_replication_delay/testdir/pgpool.log
> >> and src/test/regression/log/041.external_replication_delay.
> >>
> >> Best regards,
> >> --
> >> Tatsuo Ishii
> >> SRA OSS K.K.
> >> English: http://www.sraoss.co.jp/index_en/
> >> Japanese:http://www.sraoss.co.jp
> >>
> >
> >
> > --
> > Nadav Shatz
> > Tailor Brands | CTO
>


-- 
Nadav Shatz
Tailor Brands | CTO


Attachments:

  [application/octet-stream] 0001-Fix-multiple-issues-in-external-replication-delay-fe.patch (8.1K, 3-0001-Fix-multiple-issues-in-external-replication-delay-fe.patch)
  download | inline diff:
From b43c38293e62163ac6371ca671874d70aa432e3d Mon Sep 17 00:00:00 2001
From: Nadav Shatz <[email protected]>
Date: Tue, 18 Nov 2025 13:19:02 +0200
Subject: [PATCH] Fix multiple issues in external replication delay feature

This commit addresses several critical bugs and code quality issues
in the external replication delay command implementation:

Memory Safety:
- Initialize command pointer to NULL to prevent undefined behavior
  when freeing uninitialized memory

Security:
- Add hostname validation to detect shell metacharacters
- Warn about potentially dangerous characters in hostnames
- Changed warning level from WARNING to LOG to avoid excessive alerts

Signal Handler Race Conditions:
- Reorder signal(SIGALRM, SIG_DFL) before alarm(0) in all error paths
- Prevents race condition where alarm could trigger after handler removal
- Fixes: popen failure, fgets failure, and exception handler paths

Primary Node Race Condition:
- Capture REAL_PRIMARY_NODE_ID into local variable at function start
- Use captured value consistently throughout function execution
- Prevents inconsistencies if primary changes during execution

Output Parsing Robustness:
- Enhanced validation for external command output
- Provide specific warnings for: no output, too few values, too many values
- Improved error messages with helpful hints for operators

Code Quality:
- Remove unused replica_idx variable to eliminate compiler warnings
- Improve code readability and maintainability

All changes have been tested in a Docker-based integration environment
with PostgreSQL primary/replica setup and confirmed working correctly.
---
 src/streaming_replication/pool_worker_child.c | 68 +++++++++++++------
 1 file changed, 48 insertions(+), 20 deletions(-)

diff --git a/src/streaming_replication/pool_worker_child.c b/src/streaming_replication/pool_worker_child.c
index 81dc82922..1ae290e28 100644
--- a/src/streaming_replication/pool_worker_child.c
+++ b/src/streaming_replication/pool_worker_child.c
@@ -694,11 +694,10 @@ static void
 check_replication_time_lag_with_cmd(void)
 {
 	FILE		   *fp;
-	char		   *command;
+	char		   *command = NULL;
 	char		   *line;
 	char		   *token;
 	char		   *saveptr;
-	int				replica_idx;
 	int				num_replicas;
 	double			delay_ms;
 	uint64			delay;
@@ -724,6 +723,9 @@ check_replication_time_lag_with_cmd(void)
 		return;
 	}
 
+	/* Capture primary node ID to avoid race conditions during execution */
+	int primary_node_id = REAL_PRIMARY_NODE_ID;
+
 	if (!pool_config->replication_delay_source_cmd ||
 		strlen(pool_config->replication_delay_source_cmd) == 0)
 	{
@@ -757,7 +759,7 @@ check_replication_time_lag_with_cmd(void)
 		/* Calculate total command length including space-separated replica identifiers */
 		for (int i = 0; i < NUM_BACKENDS; i++)
 		{
-			if (i == REAL_PRIMARY_NODE_ID)
+			if (i == primary_node_id)
 				continue; /* Skip primary node */
 
 			char *ident = build_instance_identifier_for_node(i);
@@ -771,7 +773,7 @@ check_replication_time_lag_with_cmd(void)
 		/* Append replica identifiers */
 		for (int i = 0; i < NUM_BACKENDS; i++)
 		{
-			if (i == REAL_PRIMARY_NODE_ID)
+			if (i == primary_node_id)
 				continue; /* Skip primary node */
 
 			char *ident = build_instance_identifier_for_node(i);
@@ -791,8 +793,9 @@ check_replication_time_lag_with_cmd(void)
 		fp = popen(command, "r");
 		if (fp == NULL)
 		{
-			alarm(0); /* Cancel alarm */
+			/* Cancel timeout: restore signal handler first to avoid race condition */
 			signal(SIGALRM, SIG_DFL);
+			alarm(0);
 			ereport(ERROR,
 					(errmsg("failed to execute replication delay command: %s", command),
 					 errdetail("popen failed: %m")));
@@ -802,8 +805,9 @@ check_replication_time_lag_with_cmd(void)
 		{
 			int pclose_result = pclose(fp);
 			fp = NULL;
-			alarm(0); /* Cancel alarm */
+			/* Cancel timeout: restore signal handler first to avoid race condition */
 			signal(SIGALRM, SIG_DFL);
+			alarm(0);
 
 			if (command_timeout_occurred)
 			{
@@ -820,8 +824,9 @@ check_replication_time_lag_with_cmd(void)
 			}
 		}
 
-		alarm(0); /* Cancel alarm */
+		/* Cancel timeout: restore signal handler first to avoid race condition */
 		signal(SIGALRM, SIG_DFL);
+		alarm(0);
 
 		/* Check if output was truncated */
 		if (strlen(line) == MAX_CMD_OUTPUT - 1 && line[MAX_CMD_OUTPUT - 2] != '\n')
@@ -836,7 +841,7 @@ check_replication_time_lag_with_cmd(void)
 		command = NULL;
 
 		/* Set primary node delay to 0 */
-		bkinfo = pool_get_node_info(REAL_PRIMARY_NODE_ID);
+		bkinfo = pool_get_node_info(primary_node_id);
 		bkinfo->standby_delay = 0;
 		bkinfo->standby_delay_by_time = true;
 
@@ -853,28 +858,40 @@ check_replication_time_lag_with_cmd(void)
 		}
 		pfree(line_copy);
 
-		if (token_count != num_replicas)
+		/* Validate output format */
+		if (token_count == 0)
+		{
+			ereport(WARNING,
+					(errmsg("replication delay command produced no output"),
+					 errhint("Command should output delay values separated by spaces, one per replica node")));
+		}
+		else if (token_count < num_replicas)
 		{
 			ereport(WARNING,
 					(errmsg("replication delay command returned %d values, expected %d (one per replica, excluding primary)",
 							token_count, num_replicas),
-					 errhint("Command should output one delay value per replica node")));
+					 errhint("Command should output one delay value per replica node. Missing values will be treated as 0.")));
+		}
+		else if (token_count > num_replicas)
+		{
+			ereport(WARNING,
+					(errmsg("replication delay command returned %d values, expected %d (one per replica, excluding primary)",
+							token_count, num_replicas),
+					 errhint("Command should output exactly one delay value per replica node. Extra values will be ignored.")));
 		}
 
-		/* Parse the output - one delay value per replica in order */
-		token = strtok_r(line, " \t\n", &saveptr);
-		replica_idx = 0;
+	/* Parse the output - one delay value per replica in order */
+	token = strtok_r(line, " \t\n", &saveptr);
 
 		for (int i = 0; i < NUM_BACKENDS && token != NULL; i++)
 		{
-			if (i == REAL_PRIMARY_NODE_ID)
+			if (i == primary_node_id)
 				continue; /* Skip primary - it's not in the output */
 
 			if (!VALID_BACKEND(i))
 			{
 				/* Skip invalid backend but consume token */
 				token = strtok_r(NULL, " \t\n", &saveptr);
-				replica_idx++;
 				continue;
 			}
 
@@ -900,7 +917,6 @@ check_replication_time_lag_with_cmd(void)
 								i)));
 				/* Keep previous delay value, don't trigger failover */
 				token = strtok_r(NULL, " \t\n", &saveptr);
-				replica_idx++;
 				continue;
 			}
 
@@ -933,19 +949,19 @@ check_replication_time_lag_with_cmd(void)
 			{
 				ereport(LOG,
 						(errmsg("Replication of node: %d is behind %.3f second(s) from the primary server (node: %d) [external command]",
-								i, delay_ms / 1000, REAL_PRIMARY_NODE_ID)));
+								i, delay_ms / 1000, primary_node_id)));
 			}
 
 			token = strtok_r(NULL, " \t\n", &saveptr);
-			replica_idx++;
 		}
 
 	}
 	PG_CATCH();
 	{
 		/* Cleanup in case of error */
-		alarm(0); /* Cancel any pending alarm */
+		/* Cancel timeout: restore signal handler first to avoid race condition */
 		signal(SIGALRM, SIG_DFL);
+		alarm(0);
 		if (fp)
 		{
 			pclose(fp);
@@ -976,6 +992,7 @@ static char *
 build_instance_identifier_for_node(int node_id)
 {
 	BackendInfo *bi = pool_get_node_info(node_id);
+	const char *hostname;
 
 	if (!bi || bi->backend_hostname[0] == '\0' || bi->backend_port <= 0)
 	{
@@ -983,8 +1000,19 @@ build_instance_identifier_for_node(int node_id)
 		return psprintf("unknown_node_%d", node_id);
 	}
 
+	hostname = bi->backend_hostname;
+
+	/* Validate hostname for security - check for shell metacharacters */
+	if (strpbrk(hostname, "$`\\|;&<>()[]{}\"\'\n\r\t") != NULL)
+	{
+		ereport(LOG,
+				(errmsg("hostname for node %d contains potentially dangerous characters: %s",
+						node_id, hostname),
+				 errhint("Hostnames with shell metacharacters may pose security risks when used with external commands. Consider using IP addresses or sanitized hostnames.")));
+	}
+
 	/* Use hostname:port format */
-	return psprintf("%s:%d", bi->backend_hostname, bi->backend_port);
+	return psprintf("%s:%d", hostname, bi->backend_port);
 }
 
 static void
-- 
2.52.0



^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-11-19 23:09  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2025-11-19 23:09 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

Hi Nadav,

Thank you for new patch.
Unfortunately the patch did not apply to current master.

$ git apply ~/0001-Fix-multiple-issues-in-external-replication-delay-fe.patch 
error: patch failed: src/streaming_replication/pool_worker_child.c:694
error: src/streaming_replication/pool_worker_child.c: patch does not apply

Maybe the patch is on top of your previous patch?

Also I suggest to use "-v" option of "git format-patch" to add the
patch version number so that we can easily know which patch is the
latest.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

> Hi Tatsuo,
> 
> Please see attached an updated version.
> 
> thank you
> 
> On Fri, Nov 7, 2025 at 2:07 AM Tatsuo Ishii <[email protected]> wrote:
> 
>> > Sorry for that - thanks for the patch.
>> >
>> > Please find attached a new version
>>
>> Thanks for the new version. Unfortunately this time regression test
>> fails at:
>>
>> > Waiting for command timeout...
>> > fail: command timeout not detected
>>
>> Attached is the pgpool.log.
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
>> > On Mon, Nov 3, 2025 at 9:05 AM Tatsuo Ishii <[email protected]>
>> wrote:
>> >
>> >> > thanks and sorry for the issues, please find attached updated version.
>> >>
>> >> No problem.
>> >>
>> >> This time the patch applies fine, no compiler warnings.  However,
>> >> regression test did not passed here (on Ubuntu 24 LTS if this
>> >> matters).  So I looked into
>> >> src/test/regression/tests/041.external_replication_delay/test.sh a
>> >> little bit and apply attached patch (test.sh.patch). It moved forward
>> >> partially but failed at:
>> >>
>> >> fail: command execution failure not detected
>> >>
>> >> Please find attached
>> >>
>> src/test/regression/tests/041.external_replication_delay/testdir/pgpool.log
>> >> and src/test/regression/log/041.external_replication_delay.
>> >>
>> >> Best regards,
>> >> --
>> >> Tatsuo Ishii
>> >> SRA OSS K.K.
>> >> English: http://www.sraoss.co.jp/index_en/
>> >> Japanese:http://www.sraoss.co.jp
>> >>
>> >
>> >
>> > --
>> > Nadav Shatz
>> > Tailor Brands | CTO
>>
> 
> 
> -- 
> Nadav Shatz
> Tailor Brands | CTO


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-11-23 09:53  Nadav Shatz <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Nadav Shatz @ 2025-11-23 09:53 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

Hi Tatsuo,

Sorry again, this was due to the separation of 2 patches and i only sent
the one.

I've merged it into 1 commit and 1 patch and rebased over master to avoid
these issues moving forward.

PFA latest version

On Thu, Nov 20, 2025 at 1:09 AM Tatsuo Ishii <[email protected]> wrote:

> Hi Nadav,
>
> Thank you for new patch.
> Unfortunately the patch did not apply to current master.
>
> $ git apply
> ~/0001-Fix-multiple-issues-in-external-replication-delay-fe.patch
> error: patch failed: src/streaming_replication/pool_worker_child.c:694
> error: src/streaming_replication/pool_worker_child.c: patch does not apply
>
> Maybe the patch is on top of your previous patch?
>
> Also I suggest to use "-v" option of "git format-patch" to add the
> patch version number so that we can easily know which patch is the
> latest.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
> > Hi Tatsuo,
> >
> > Please see attached an updated version.
> >
> > thank you
> >
> > On Fri, Nov 7, 2025 at 2:07 AM Tatsuo Ishii <[email protected]>
> wrote:
> >
> >> > Sorry for that - thanks for the patch.
> >> >
> >> > Please find attached a new version
> >>
> >> Thanks for the new version. Unfortunately this time regression test
> >> fails at:
> >>
> >> > Waiting for command timeout...
> >> > fail: command timeout not detected
> >>
> >> Attached is the pgpool.log.
> >>
> >> Best regards,
> >> --
> >> Tatsuo Ishii
> >> SRA OSS K.K.
> >> English: http://www.sraoss.co.jp/index_en/
> >> Japanese:http://www.sraoss.co.jp
> >>
> >> > On Mon, Nov 3, 2025 at 9:05 AM Tatsuo Ishii <[email protected]>
> >> wrote:
> >> >
> >> >> > thanks and sorry for the issues, please find attached updated
> version.
> >> >>
> >> >> No problem.
> >> >>
> >> >> This time the patch applies fine, no compiler warnings.  However,
> >> >> regression test did not passed here (on Ubuntu 24 LTS if this
> >> >> matters).  So I looked into
> >> >> src/test/regression/tests/041.external_replication_delay/test.sh a
> >> >> little bit and apply attached patch (test.sh.patch). It moved forward
> >> >> partially but failed at:
> >> >>
> >> >> fail: command execution failure not detected
> >> >>
> >> >> Please find attached
> >> >>
> >>
> src/test/regression/tests/041.external_replication_delay/testdir/pgpool.log
> >> >> and src/test/regression/log/041.external_replication_delay.
> >> >>
> >> >> Best regards,
> >> >> --
> >> >> Tatsuo Ishii
> >> >> SRA OSS K.K.
> >> >> English: http://www.sraoss.co.jp/index_en/
> >> >> Japanese:http://www.sraoss.co.jp
> >> >>
> >> >
> >> >
> >> > --
> >> > Nadav Shatz
> >> > Tailor Brands | CTO
> >>
> >
> >
> > --
> > Nadav Shatz
> > Tailor Brands | CTO
>


-- 
Nadav Shatz
Tailor Brands | CTO


Attachments:

  [application/octet-stream] v1-0001-feat-external-replication-delay-injection-via-ext.patch (48.7K, 3-v1-0001-feat-external-replication-delay-injection-via-ext.patch)
  download | inline diff:
From caae8e05284ba94090048758ec4e6b49a6b4d152 Mon Sep 17 00:00:00 2001
From: Nadav Shatz <[email protected]>
Date: Sun, 23 Nov 2025 10:56:30 +0200
Subject: [PATCH v1] feat: external replication delay injection via external
 command

---
 doc/src/sgml/stream-check.sgml                |  68 +++
 src/config/pool_config_variables.c            |  21 +
 src/include/pool_config.h                     |   3 +-
 src/sample/pgpool.conf.sample-stream          |  14 +
 src/streaming_replication/pool_worker_child.c | 364 +++++++++++++++-
 .../041.external_replication_delay/README     |  59 +++
 .../041.external_replication_delay/test.sh    | 404 ++++++++++++++++++
 .../test_parsing.sh                           |  54 +++
 .../test_validation.sh                        | 323 ++++++++++++++
 9 files changed, 1305 insertions(+), 5 deletions(-)
 create mode 100644 src/test/regression/tests/041.external_replication_delay/README
 create mode 100755 src/test/regression/tests/041.external_replication_delay/test.sh
 create mode 100755 src/test/regression/tests/041.external_replication_delay/test_parsing.sh
 create mode 100755 src/test/regression/tests/041.external_replication_delay/test_validation.sh

diff --git a/doc/src/sgml/stream-check.sgml b/doc/src/sgml/stream-check.sgml
index d2ca3ca49..fc4799080 100644
--- a/doc/src/sgml/stream-check.sgml
+++ b/doc/src/sgml/stream-check.sgml
@@ -309,6 +309,74 @@ GRANT pg_monitor TO sr_check_user;
     </listitem>
   </varlistentry>
 
+  <varlistentry id="guc-replication-delay-source-cmd" xreflabel="replication_delay_source_cmd">
+   <term><varname>replication_delay_source_cmd</varname> (<type>string</type>)
+    <indexterm>
+     <primary><varname>replication_delay_source_cmd</varname> configuration parameter</primary>
+    </indexterm>
+   </term>
+   <listitem>
+    <para>
+     Specifies an external command to retrieve replication delay information for replica nodes.
+     When this parameter is set and not empty, <productname>Pgpool-II</productname> uses the
+     external command instead of built-in database queries to obtain replication delays.
+     The command is executed as the <productname>Pgpool-II</productname> process user.
+    </para>
+    <para>
+     The command receives replica node identifiers as positional arguments, with the primary
+     node omitted. Each identifier is in the format <literal>&lt;hostname&gt;:&lt;port&gt;</literal>,
+     for example <literal>server1:5432 server2:5432</literal>. The order matches
+     <productname>Pgpool-II</productname>'s backend order (excluding the primary), allowing the
+     script to correlate external metrics (such as from AWS CloudWatch for Aurora) to the correct nodes.
+    </para>
+    <para>
+     The command must write a single line to stdout containing one whitespace-separated delay value
+     per replica, in milliseconds, in the same order as the arguments. The primary node's delay is
+     implicitly zero and should not be included in the output. Delay values can be integers or
+     floating-point numbers.
+    </para>
+    <para>
+     Special value: <literal>-1</literal> indicates a replica that is down but not yet detected
+     by <productname>Pgpool-II</productname>'s health checks. <productname>Pgpool-II</productname>
+     will log this condition but rely on its own health-check logic to decide whether to trigger
+     failover; no failover is triggered solely by receiving <literal>-1</literal>.
+    </para>
+    <para>
+     Example for a 3-node cluster (1 primary + 2 replicas): if the command receives arguments
+     <literal>server1:5432 server2:5432</literal>, it should output <literal>"25.5 100"</literal>
+     to indicate the first replica has 25.5ms delay and the second has 100ms delay.
+    </para>
+    <para>
+     Default is empty (use built-in replication delay queries).
+    </para>
+    <para>
+     This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+    </para>
+   </listitem>
+  </varlistentry>
+
+  <varlistentry id="guc-replication-delay-source-timeout" xreflabel="replication_delay_source_timeout">
+   <term><varname>replication_delay_source_timeout</varname> (<type>integer</type>)
+    <indexterm>
+     <primary><varname>replication_delay_source_timeout</varname> configuration parameter</primary>
+    </indexterm>
+   </term>
+   <listitem>
+    <para>
+     Specifies the timeout in seconds for the external command specified by
+     <xref linkend="guc-replication-delay-source-cmd">.
+     If the command does not finish within the timeout, <productname>Pgpool-II</productname>
+     logs an error and continues using the built-in method.
+    </para>
+    <para>
+     Default is 10 seconds. Valid range is 1-3600 seconds.
+    </para>
+    <para>
+     This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+    </para>
+   </listitem>
+  </varlistentry>
+
   <varlistentry id="guc-log-standby-delay" xreflabel="log_standby_delay">
    <term><varname>log_standby_delay</varname> (<type>enum</type>)
     <indexterm>
diff --git a/src/config/pool_config_variables.c b/src/config/pool_config_variables.c
index 62a05979a..a35d2200f 100644
--- a/src/config/pool_config_variables.c
+++ b/src/config/pool_config_variables.c
@@ -980,6 +980,16 @@ static struct config_string ConfigureNamesString[] =
 		NULL, NULL, NULL, NULL
 	},
 
+	{
+		{"replication_delay_source_cmd", CFGCXT_RELOAD, STREAMING_REPLICATION_CONFIG,
+			"External command to retrieve replication delay information.",
+			CONFIG_VAR_TYPE_STRING, false, 0
+		},
+		&g_pool_config.replication_delay_source_cmd,
+		"",
+		NULL, NULL, NULL, NULL
+	},
+
 	{
 		{"failback_command", CFGCXT_RELOAD, FAILOVER_CONFIG,
 			"Command to execute when backend node is attached.",
@@ -2334,6 +2344,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"replication_delay_source_timeout", CFGCXT_RELOAD, STREAMING_REPLICATION_CONFIG,
+			"Timeout for external replication delay command execution in seconds.",
+			CONFIG_VAR_TYPE_INT, false, 0
+		},
+		&g_pool_config.replication_delay_source_timeout,
+		10,
+		1, 3600,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	EMPTY_CONFIG_INT
 };
diff --git a/src/include/pool_config.h b/src/include/pool_config.h
index 9160a31c8..5bc646805 100644
--- a/src/include/pool_config.h
+++ b/src/include/pool_config.h
@@ -86,7 +86,6 @@ typedef enum LogStandbyDelayModes
 	LSD_NONE
 } LogStandbyDelayModes;
 
-
 typedef enum MemCacheMethod
 {
 	SHMEM_CACHE = 1,
@@ -363,6 +362,8 @@ typedef struct
 	char	   *sr_check_password;	/* password for sr_check_user */
 	char	   *sr_check_database;	/* PostgreSQL database name for streaming
 									 * replication check */
+	char	   *replication_delay_source_cmd;	/* external command for replication delay */
+	int			replication_delay_source_timeout;	/* timeout for external command in seconds */
 	char	   *failover_command;	/* execute command when failover happens */
 	char	   *follow_primary_command; /* execute command when failover is
 										 * ended */
diff --git a/src/sample/pgpool.conf.sample-stream b/src/sample/pgpool.conf.sample-stream
index 797906491..454fdb9e5 100644
--- a/src/sample/pgpool.conf.sample-stream
+++ b/src/sample/pgpool.conf.sample-stream
@@ -519,6 +519,20 @@ backend_clustering_mode = streaming_replication
 
 #sr_check_database = 'postgres'
                                    # Database name for streaming replication check
+
+#replication_delay_source_cmd = ''
+                                   # External command to retrieve replication delay information
+                                   # If set, pgpool uses this command instead of built-in queries
+                                   # Command receives replica node identifiers (host:port) as arguments
+                                   # Primary node is omitted from arguments
+                                   # Command should output one delay value (in ms) per replica
+                                   # Use -1 to indicate a replica that is down but not yet detected
+                                   # Format: "25 100" for 2 replicas (e.g., 3-node cluster with 1 primary)
+                                   # Command runs as the pgpool process user
+#replication_delay_source_timeout = 10
+                                   # Timeout for external command execution in seconds
+                                   # Range: 1-3600 seconds (default: 10)
+
 #delay_threshold = 0
                                    # Threshold before not dispatching query to standby node
                                    # Unit is in bytes
diff --git a/src/streaming_replication/pool_worker_child.c b/src/streaming_replication/pool_worker_child.c
index 5bf19c37d..1ae290e28 100644
--- a/src/streaming_replication/pool_worker_child.c
+++ b/src/streaming_replication/pool_worker_child.c
@@ -76,6 +76,8 @@ static volatile sig_atomic_t restart_request = 0;
 static void establish_persistent_connection(void);
 static void discard_persistent_connection(void);
 static void check_replication_time_lag(void);
+static void check_replication_time_lag_with_cmd(void);
+static char *build_instance_identifier_for_node(int node_id);
 static void CheckReplicationTimeLagErrorCb(void *arg);
 static unsigned long long int text_to_lsn(char *text);
 static RETSIGTYPE my_signal_handler(int sig);
@@ -259,11 +261,16 @@ do_worker_child(void *params)
 					POOL_NODE_STATUS *node_status;
 					int			i;
 
-					/* Do replication time lag checking */
-					check_replication_time_lag();
+			/* Do replication time lag checking */
+			/* Use external command if replication_delay_source_cmd is configured */
+			if (pool_config->replication_delay_source_cmd &&
+				strlen(pool_config->replication_delay_source_cmd) > 0)
+				check_replication_time_lag_with_cmd();
+			else
+				check_replication_time_lag();
 
-					/* Check node status */
-					node_status = verify_backend_node_status(slots);
+			/* Check node status */
+			node_status = verify_backend_node_status(slots);
 
 
 					for (i = 0; i < NUM_BACKENDS; i++)
@@ -659,6 +666,355 @@ check_replication_time_lag(void)
 	error_context_stack = callback.previous;
 }
 
+#define MAX_CMD_OUTPUT 4096
+#define MAX_REASONABLE_DELAY_MS 3600000.0  /* 1 hour in milliseconds */
+
+/* Global variable to track command timeout */
+static volatile sig_atomic_t command_timeout_occurred = 0;
+
+/*
+ * Signal handler for command timeou
+ */
+static void
+command_timeout_handler(int sig)
+{
+	command_timeout_occurred = 1;
+}
+
+
+
+/*
+ * Check replication time lag using external command
+ *
+ * The external command receives only replica (standby) node identifiers as arguments,
+ * omitting the primary node. It returns delay values in milliseconds for each replica.
+ * A value of -1 indicates a node that is down but not yet detected by pgpool's health checks.
+ */
+static void
+check_replication_time_lag_with_cmd(void)
+{
+	FILE		   *fp;
+	char		   *command = NULL;
+	char		   *line;
+	char		   *token;
+	char		   *saveptr;
+	int				num_replicas;
+	double			delay_ms;
+	uint64			delay;
+	int				token_count = 0;
+	BackendInfo	   *bkinfo;
+	ErrorContextCallback callback;
+
+	if (NUM_BACKENDS <= 1)
+	{
+		/* If there's only one node, there's no point to do checking */
+		return;
+	}
+
+	if (REAL_PRIMARY_NODE_ID < 0)
+	{
+		/* No need to check if there's no primary */
+		return;
+	}
+
+	if (!VALID_BACKEND(REAL_PRIMARY_NODE_ID))
+	{
+		/* No need to check replication delay if primary is down */
+		return;
+	}
+
+	/* Capture primary node ID to avoid race conditions during execution */
+	int primary_node_id = REAL_PRIMARY_NODE_ID;
+
+	if (!pool_config->replication_delay_source_cmd ||
+		strlen(pool_config->replication_delay_source_cmd) == 0)
+	{
+		ereport(WARNING,
+				(errmsg("replication_delay_source_cmd is not configured"),
+				 errhint("Set replication_delay_source_cmd to use external command mode")));
+		/* Fall back to builtin method */
+		check_replication_time_lag();
+		return;
+	}
+
+	/* Allocate buffer for command output */
+	line = palloc(MAX_CMD_OUTPUT);
+	fp = NULL;
+
+	/*
+	 * Register a error context callback to throw proper context message
+	 */
+	callback.callback = CheckReplicationTimeLagErrorCb;
+	callback.arg = NULL;
+	callback.previous = error_context_stack;
+	error_context_stack = &callback;
+
+	/* Execute command as current process user */
+	PG_TRY();
+	{
+		const char *base_command = pool_config->replication_delay_source_cmd;
+		size_t total_len = strlen(base_command) + 1; /* +1 for NUL */
+
+		/* Build command with replica-only arguments (omit primary) */
+		/* Calculate total command length including space-separated replica identifiers */
+		for (int i = 0; i < NUM_BACKENDS; i++)
+		{
+			if (i == primary_node_id)
+				continue; /* Skip primary node */
+
+			char *ident = build_instance_identifier_for_node(i);
+			total_len += 1 /* space */ + strlen(ident);
+			pfree(ident);
+		}
+
+		command = palloc(total_len);
+		strlcpy(command, base_command, total_len);
+
+		/* Append replica identifiers */
+		for (int i = 0; i < NUM_BACKENDS; i++)
+		{
+			if (i == primary_node_id)
+				continue; /* Skip primary node */
+
+			char *ident = build_instance_identifier_for_node(i);
+			strlcat(command, " ", total_len);
+			strlcat(command, ident, total_len);
+			pfree(ident);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("executing replication delay command: %s", command)));
+
+		/* Set up timeout for command execution */
+		command_timeout_occurred = 0;
+		signal(SIGALRM, command_timeout_handler);
+		alarm(pool_config->replication_delay_source_timeout);
+
+		fp = popen(command, "r");
+		if (fp == NULL)
+		{
+			/* Cancel timeout: restore signal handler first to avoid race condition */
+			signal(SIGALRM, SIG_DFL);
+			alarm(0);
+			ereport(ERROR,
+					(errmsg("failed to execute replication delay command: %s", command),
+					 errdetail("popen failed: %m")));
+		}
+
+		if (fgets(line, MAX_CMD_OUTPUT, fp) == NULL)
+		{
+			int pclose_result = pclose(fp);
+			fp = NULL;
+			/* Cancel timeout: restore signal handler first to avoid race condition */
+			signal(SIGALRM, SIG_DFL);
+			alarm(0);
+
+			if (command_timeout_occurred)
+			{
+				ereport(ERROR,
+						(errmsg("replication delay command timed out after %d seconds: %s",
+								pool_config->replication_delay_source_timeout, command),
+						 errhint("Consider increasing replication_delay_source_timeout or optimizing the command")));
+			}
+			else
+			{
+				ereport(ERROR,
+						(errmsg("failed to read output from replication delay command: %s", command),
+						 errdetail("command exit status: %d", pclose_result)));
+			}
+		}
+
+		/* Cancel timeout: restore signal handler first to avoid race condition */
+		signal(SIGALRM, SIG_DFL);
+		alarm(0);
+
+		/* Check if output was truncated */
+		if (strlen(line) == MAX_CMD_OUTPUT - 1 && line[MAX_CMD_OUTPUT - 2] != '\n')
+		{
+			ereport(WARNING,
+					(errmsg("replication delay command output may have been truncated")));
+		}
+
+		pclose(fp);
+		fp = NULL;
+		pfree(command);
+		command = NULL;
+
+		/* Set primary node delay to 0 */
+		bkinfo = pool_get_node_info(primary_node_id);
+		bkinfo->standby_delay = 0;
+		bkinfo->standby_delay_by_time = true;
+
+		/* Count expected replicas */
+		num_replicas = NUM_BACKENDS - 1; /* Total nodes minus primary */
+
+		/* Count tokens in output for validation */
+		char *line_copy = pstrdup(line);
+		char *temp_token = strtok(line_copy, " \t\n");
+		while (temp_token != NULL)
+		{
+			token_count++;
+			temp_token = strtok(NULL, " \t\n");
+		}
+		pfree(line_copy);
+
+		/* Validate output format */
+		if (token_count == 0)
+		{
+			ereport(WARNING,
+					(errmsg("replication delay command produced no output"),
+					 errhint("Command should output delay values separated by spaces, one per replica node")));
+		}
+		else if (token_count < num_replicas)
+		{
+			ereport(WARNING,
+					(errmsg("replication delay command returned %d values, expected %d (one per replica, excluding primary)",
+							token_count, num_replicas),
+					 errhint("Command should output one delay value per replica node. Missing values will be treated as 0.")));
+		}
+		else if (token_count > num_replicas)
+		{
+			ereport(WARNING,
+					(errmsg("replication delay command returned %d values, expected %d (one per replica, excluding primary)",
+							token_count, num_replicas),
+					 errhint("Command should output exactly one delay value per replica node. Extra values will be ignored.")));
+		}
+
+	/* Parse the output - one delay value per replica in order */
+	token = strtok_r(line, " \t\n", &saveptr);
+
+		for (int i = 0; i < NUM_BACKENDS && token != NULL; i++)
+		{
+			if (i == primary_node_id)
+				continue; /* Skip primary - it's not in the output */
+
+			if (!VALID_BACKEND(i))
+			{
+				/* Skip invalid backend but consume token */
+				token = strtok_r(NULL, " \t\n", &saveptr);
+				continue;
+			}
+
+			char *endptr;
+			delay_ms = strtod(token, &endptr);
+
+			/* Validate the conversion */
+			if (*endptr != '\0')
+			{
+				ereport(WARNING,
+						(errmsg("invalid delay value '%s' for node %d, treating as 0",
+								token, i)));
+				delay_ms = 0;
+			}
+
+			bkinfo = pool_get_node_info(i);
+
+			/* Handle -1 for down nodes */
+			if (delay_ms == -1.0)
+			{
+				ereport(LOG,
+						(errmsg("node %d reported as down by external command (delay -1), relying on health check for failover decision",
+								i)));
+				/* Keep previous delay value, don't trigger failover */
+				token = strtok_r(NULL, " \t\n", &saveptr);
+				continue;
+			}
+
+			/* Validate delay value range */
+			if (delay_ms < 0)
+			{
+				ereport(WARNING,
+						(errmsg("negative delay value %.3f for node %d (other than -1), treating as 0",
+								delay_ms, i)));
+				delay_ms = 0;
+			}
+			else if (delay_ms > MAX_REASONABLE_DELAY_MS)
+			{
+				ereport(WARNING,
+						(errmsg("extremely large delay value %.3f for node %d",
+								delay_ms, i)));
+			}
+
+			/* Convert delay from milliseconds to microseconds for internal storage */
+			delay = (uint64)(delay_ms * 1000);
+			bkinfo->standby_delay = delay;
+			bkinfo->standby_delay_by_time = true;
+
+			/* Log delay if necessary */
+			uint64 delay_threshold_by_time = pool_config->delay_threshold_by_time * 1000; /* threshold is in milliseconds, convert to microseconds */
+
+			if ((pool_config->log_standby_delay == LSD_ALWAYS && delay_ms > 0) ||
+				(pool_config->log_standby_delay == LSD_OVER_THRESHOLD &&
+				 bkinfo->standby_delay > delay_threshold_by_time))
+			{
+				ereport(LOG,
+						(errmsg("Replication of node: %d is behind %.3f second(s) from the primary server (node: %d) [external command]",
+								i, delay_ms / 1000, primary_node_id)));
+			}
+
+			token = strtok_r(NULL, " \t\n", &saveptr);
+		}
+
+	}
+	PG_CATCH();
+	{
+		/* Cleanup in case of error */
+		/* Cancel timeout: restore signal handler first to avoid race condition */
+		signal(SIGALRM, SIG_DFL);
+		alarm(0);
+		if (fp)
+		{
+			pclose(fp);
+			fp = NULL;
+		}
+		if (line)
+			pfree(line);
+		if (command)
+			pfree(command);
+		error_context_stack = callback.previous;
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	/* Normal cleanup */
+	if (line)
+		pfree(line);
+
+	error_context_stack = callback.previous;
+}
+
+/*
+ * build_instance_identifier_for_node
+ *  Build an identifier string for a backend node for passing to external commands.
+ *  Format: "<hostname>:<port>"
+ */
+static char *
+build_instance_identifier_for_node(int node_id)
+{
+	BackendInfo *bi = pool_get_node_info(node_id);
+	const char *hostname;
+
+	if (!bi || bi->backend_hostname[0] == '\0' || bi->backend_port <= 0)
+	{
+		/* Fallback if hostname or port is not set */
+		return psprintf("unknown_node_%d", node_id);
+	}
+
+	hostname = bi->backend_hostname;
+
+	/* Validate hostname for security - check for shell metacharacters */
+	if (strpbrk(hostname, "$`\\|;&<>()[]{}\"\'\n\r\t") != NULL)
+	{
+		ereport(LOG,
+				(errmsg("hostname for node %d contains potentially dangerous characters: %s",
+						node_id, hostname),
+				 errhint("Hostnames with shell metacharacters may pose security risks when used with external commands. Consider using IP addresses or sanitized hostnames.")));
+	}
+
+	/* Use hostname:port format */
+	return psprintf("%s:%d", hostname, bi->backend_port);
+}
+
 static void
 CheckReplicationTimeLagErrorCb(void *arg)
 {
diff --git a/src/test/regression/tests/041.external_replication_delay/README b/src/test/regression/tests/041.external_replication_delay/README
new file mode 100644
index 000000000..b4df5da40
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/README
@@ -0,0 +1,59 @@
+External Replication Delay Command Test
+========================================
+
+This test verifies the external command replication delay source feature.
+
+Test Coverage:
+- External command receives replica node identifiers only (primary omitted)
+- Instance identifiers in host:port format
+- Basic external command execution with integer and float millisecond values
+- Delay threshold functionality with external commands
+- Command execution as pgpool process user (no su wrapper)
+- Error handling for missing/invalid commands
+- Command execution failure scenarios
+- Command timeout handling with configurable timeout values
+- Input validation for invalid, negative (other than -1), and extremely large delay values
+- Handling of -1 for down nodes (logged but no immediate failover)
+- Wrong number of output values validation
+- Multiple -1 values (multiple down replicas)
+- Mixed scenarios (some replicas up, some down)
+- Output truncation detection
+
+Files:
+- test.sh: Main test script
+- test_parsing.sh: Unit test for parsing logic
+- test_validation.sh: Validation and edge case testing
+- README: This documentation
+
+Key Changes from Original Version:
+- Primary node is omitted from command arguments
+- Command receives only replica identifiers
+- Instance identifiers are in host:port format (not application_name)
+- Output format: one delay per replica (not per all nodes)
+- -1 value indicates down replica without triggering failover
+- Format example: "25 100" for 2 replicas (3-node cluster = 1 primary + 2 replicas)
+
+The test creates temporary command scripts that output delay values in the format:
+"replica1_delay replica2_delay ..."
+
+Where delays are in milliseconds and can be integer or floating-point values.
+Special value -1 indicates a replica that is down but not yet detected by pgpool.
+
+Test Environment:
+- Uses streaming replication mode with 3 nodes
+- Node 0 is primary (omitted from command arguments)
+- Nodes 1 and 2 are replicas (included in command arguments)
+- Configures sr_check_period = 1 second for faster testing
+- Tests various delay scenarios and threshold behaviors
+
+Expected Behavior:
+- External commands receive replica identifiers in host:port format
+- Primary node identifier is never passed to command
+- Command outputs one delay value per replica
+- -1 values are logged but don't trigger immediate failover
+- Delay values are parsed correctly (both int and float)
+- Threshold comparisons work properly
+- Error conditions are handled gracefully
+- Commands timeout appropriately based on configuration
+- Timeout errors provide helpful messages and hints
+- Tests are reliable with proper wait mechanisms instead of fixed sleeps
diff --git a/src/test/regression/tests/041.external_replication_delay/test.sh b/src/test/regression/tests/041.external_replication_delay/test.sh
new file mode 100755
index 000000000..e1dfbcecf
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/test.sh
@@ -0,0 +1,404 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for external command replication delay source
+#
+source $TESTLIBS
+TESTDIR=testdir
+PG_CTL=$PGBIN/pg_ctl
+PSQL="$PGBIN/psql -X "
+
+rm -fr $TESTDIR
+mkdir $TESTDIR
+cd $TESTDIR
+
+# create test environment
+echo -n "creating test environment..."
+$PGPOOL_SETUP -m s -n 3 || exit 1
+echo "done."
+source ./bashrc.ports
+export PGPORT=$PGPOOL_PORT
+
+# Create external command scripts for testing
+# NOTE: Commands now only output delay values for REPLICAS (not primary)
+cat > delay_cmd_static.sh << 'EOF'
+#!/bin/bash
+# Static delay values for replicas: node1=25ms, node2=50ms (node0 is primary, not included)
+echo "25 50"
+EOF
+chmod +x delay_cmd_static.sh
+
+cat > delay_cmd_float.sh << 'EOF'
+#!/bin/bash
+# Float delay values for replicas: node1=25.5ms, node2=100.75ms
+echo "25.5 100.75"
+EOF
+chmod +x delay_cmd_float.sh
+
+cat > delay_cmd_high.sh << 'EOF'
+#!/bin/bash
+# High delay values to test threshold: node1=2000ms, node2=3000ms
+echo "2000 3000"
+EOF
+chmod +x delay_cmd_high.sh
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test0: External command receives replica identifiers only (primary omitted) ==="
+# ----------------------------------------------------------------------------------------
+# Command that captures its arguments and outputs valid delays for 2 replicas
+cat > delay_cmd_args.sh << 'EOF'
+#!/bin/bash
+printf "%s " "$@" > args.txt
+echo "25 50"
+EOF
+chmod +x delay_cmd_args.sh
+
+echo "replication_delay_source_cmd = './delay_cmd_args.sh'" >> etc/pgpool.conf
+echo "sr_check_period = 1" >> etc/pgpool.conf
+echo "log_min_messages = 'DEBUG1'" >> etc/pgpool.conf
+# Reduce memory requirements for macOS shared memory limits
+echo "num_init_children = 4" >> etc/pgpool.conf
+echo "max_pool = 2" >> etc/pgpool.conf
+# Disable query caching to avoid shared memory issues on macOS
+echo "memory_cache_enabled = off" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+echo "Waiting for sr_check to pass args..."
+for i in {1..10}; do
+    if [ -f args.txt ]; then
+        break
+    fi
+    sleep 1
+done
+
+if [ ! -f args.txt ]; then
+    echo fail: did not capture command arguments
+    ./shutdownall
+    exit 1
+fi
+
+ARGS_CONTENT=$(cat args.txt | sed 's/[[:space:]]*$//')
+# Should receive 2 replica identifiers in host:port format (localhost:11003 localhost:11004 or server1:11003 server2:11004)
+# Primary (localhost:11002 or server0:11002) should be omitted
+if ! echo "$ARGS_CONTENT" | grep -qE "(server1|localhost):11003"; then
+    echo "fail: expected replica1:11003 in arguments, got: '$ARGS_CONTENT'"
+    ./shutdownall
+    exit 1
+fi
+if ! echo "$ARGS_CONTENT" | grep -qE "(server2|localhost):11004"; then
+    echo "fail: expected replica2:11004 in arguments, got: '$ARGS_CONTENT'"
+    ./shutdownall
+    exit 1
+fi
+if echo "$ARGS_CONTENT" | grep -qE "(server0|localhost):11002"; then
+    echo "fail: primary should not be in arguments, got: '$ARGS_CONTENT'"
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: argument order correct - replicas only, primary omitted, host:port format
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test1: Basic external command with integer millisecond values ==="
+# ----------------------------------------------------------------------------------------
+echo "replication_delay_source_cmd = './delay_cmd_static.sh'" >> etc/pgpool.conf
+echo "sr_check_period = 1" >> etc/pgpool.conf
+echo "log_standby_delay = 'always'" >> etc/pgpool.conf
+echo "log_min_messages = 'DEBUG1'" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+$PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+EOF
+
+# Wait for sr_check to run and populate delay values
+# sr_check_period is 1 second, so wait a bit longer to ensure it runs
+echo "Waiting for sr_check to run..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command" log/pgpool.log 2>/dev/null; then
+        echo "Command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SHOW POOL_NODES;
+EOF
+
+# Check that delay values are populated in the log
+grep "executing replication delay command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: external command was not executed
+    echo "Log contents:"
+    tail -20 log/pgpool.log
+    ./shutdownall
+    exit 1
+fi
+
+# Verify actual delay values were parsed
+if ! $PSQL -t -c "SHOW POOL_NODES" test | grep -E "[0-9]+\.[0-9]+" >/dev/null; then
+    echo "Warning: No delay values found in POOL_NODES output"
+fi
+
+# Check for delay log messages
+grep "Replication of node.*external command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: external command delay logging not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: basic external command test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test2: External command with floating-point millisecond values ==="
+# ----------------------------------------------------------------------------------------
+# Update configuration to use float command
+sed -i.bak "s|delay_cmd_static.sh|delay_cmd_float.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run with float values
+echo "Waiting for sr_check with float values..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*delay_cmd_float.sh" log/pgpool.log 2>/dev/null; then
+        echo "Float command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SHOW POOL_NODES;
+EOF
+
+# Check that float values are handled correctly
+grep "executing replication delay command.*delay_cmd_float.sh" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: float command was not executed
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: floating-point values test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test3: External command with delay threshold ==="
+# ----------------------------------------------------------------------------------------
+# Update configuration to use high delay command and set threshold
+sed -i.bak "s|delay_cmd_float.sh|delay_cmd_high.sh|" etc/pgpool.conf
+echo "delay_threshold_by_time = 1000" >> etc/pgpool.conf
+echo "backend_weight0 = 0" >> etc/pgpool.conf  # Force queries to standby normally
+echo "backend_weight2 = 0" >> etc/pgpool.conf  # Only use node 1 as standby
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run and detect high delays
+echo "Waiting for sr_check with high delay values..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*delay_cmd_high.sh" log/pgpool.log 2>/dev/null; then
+        echo "High delay command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SELECT * FROM t1 LIMIT 1;
+EOF
+
+# With high delays (2000ms > 1000ms threshold), query should go to primary (node 0)
+# Log format can vary: either "statement: SELECT..." or "SELECT... DB node id:"
+if ! grep -E "DB node id: 0.*statement: SELECT \* FROM t1 LIMIT 1" log/pgpool.log >/dev/null 2>&1 && \
+   ! grep -E "SELECT \* FROM t1 LIMIT 1.*DB node id: 0" log/pgpool.log >/dev/null 2>&1; then
+    echo fail: query was not sent to primary node despite high delay
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: delay threshold test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test4: External command execution as process user ==="
+# ----------------------------------------------------------------------------------------
+# Test that command runs as the current pgpool process user
+sed -i.bak "s|delay_cmd_high.sh|delay_cmd_static.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for sr_check to run as process user..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*delay_cmd_static.sh" log/pgpool.log 2>/dev/null; then
+        echo "Command executed as process user after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check that command was executed (without su wrapper)
+grep "executing replication delay command.*delay_cmd_static.sh" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: command was not executed as process user
+    ./shutdownall
+    exit 1
+fi
+
+# Verify no su command was used
+if grep -q "executing replication delay command.*su.*" log/pgpool.log 2>/dev/null; then
+    echo fail: command should not use su wrapper
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: process user execution test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test5: Error handling - missing command ==="
+# ----------------------------------------------------------------------------------------
+# Test error handling when command is not configured
+sed -i.bak "s|replication_delay_source_cmd = './delay_cmd_static.sh'|replication_delay_source_cmd = ''|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# With empty command, should fall back to builtin method
+# No specific error message expected - just verify it doesn't crash
+sleep 3
+
+echo "ok: empty command test succeeded (fallback to builtin)"
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test6: Error handling - command execution failure ==="
+# ----------------------------------------------------------------------------------------
+# Test error handling when command fails
+echo "replication_delay_source_cmd = './nonexistent_command.sh'" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run with failing command
+echo "Waiting for sr_check with failing command..."
+for i in {1..5}; do
+    if grep -qE "failed to (execute|read output from) replication delay command" log/pgpool.log 2>/dev/null; then
+        echo "Command failure detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for error message about command execution failure
+# Accept multiple possible error messages depending on shell behavior
+if ! grep -qE "failed to (execute|read output from) replication delay command" log/pgpool.log 2>/dev/null; then
+    echo fail: command execution failure not detected
+    echo "Log contents:"
+    tail -50 log/pgpool.log
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: command failure test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test7: Command timeout handling ==="
+# ----------------------------------------------------------------------------------------
+# Create a command that takes longer than the timeout
+cat > delay_cmd_slow.sh << 'EOF'
+#!/bin/bash
+# Slow command that takes 15 seconds (longer than default 10s timeout)
+sleep 15
+echo "25 50"
+EOF
+chmod +x delay_cmd_slow.sh
+
+# Set a short timeout and use the slow command
+sed -i.bak "s|replication_delay_source_cmd = './nonexistent_command.sh'|replication_delay_source_cmd = './delay_cmd_slow.sh'|" etc/pgpool.conf
+echo "replication_delay_source_timeout = 3" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run and timeout
+echo "Waiting for command timeout..."
+for i in {1..15}; do
+    if grep -q "replication delay command timed out" log/pgpool.log 2>/dev/null; then
+        echo "Command timeout detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for timeout error message
+grep "replication delay command timed out after 3 seconds" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: command timeout not detected
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: command timeout test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test8: Handling of -1 for down nodes ==="
+# ----------------------------------------------------------------------------------------
+# Create a command that returns -1 for one replica
+cat > delay_cmd_with_down_node.sh << 'EOF'
+#!/bin/bash
+# Return -1 for first replica (indicating it's down), normal value for second
+echo "-1 50"
+EOF
+chmod +x delay_cmd_with_down_node.sh
+
+# Reset config
+rm -f etc/pgpool.conf.bak
+sed -i.bak "s|delay_cmd_slow.sh|delay_cmd_with_down_node.sh|" etc/pgpool.conf
+sed -i.bak "s|replication_delay_source_timeout = 3|replication_delay_source_timeout = 10|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to process -1 value
+echo "Waiting for sr_check to process -1 value..."
+for i in {1..10}; do
+    if grep -q "node.*reported as down by external command.*delay -1" log/pgpool.log 2>/dev/null; then
+        echo "-1 handling detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for -1 logging message
+grep "node.*reported as down by external command.*delay -1.*relying on health check" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: -1 handling message not found
+    ./shutdownall
+    exit 1
+fi
+
+# Verify that pgpool didn't crash or trigger failover just from -1
+if grep -q "failover" log/pgpool.log 2>/dev/null; then
+    echo "fail: -1 should not trigger immediate failover"
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: -1 handling test succeeded
+./shutdownall
+
+echo "All external replication delay tests passed!"
+exit 0
diff --git a/src/test/regression/tests/041.external_replication_delay/test_parsing.sh b/src/test/regression/tests/041.external_replication_delay/test_parsing.sh
new file mode 100755
index 000000000..82fdad144
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/test_parsing.sh
@@ -0,0 +1,54 @@
+#!/bin/bash
+#-------------------------------------------------------------------
+# Unit test for external command parsing logic
+# This tests the parsing without needing a full pgpool setup
+#
+
+echo "=== Testing external command output parsing ==="
+
+# Test 1: Integer values
+echo "Test 1: Integer millisecond values"
+echo "0 25 50" > test_output.txt
+echo "Expected: 0ms, 25ms, 50ms"
+echo "Output: $(cat test_output.txt)"
+echo ""
+
+# Test 2: Float values
+echo "Test 2: Floating-point millisecond values"
+echo "0 25.5 100.75" > test_output_float.txt
+echo "Expected: 0ms, 25.5ms, 100.75ms"
+echo "Output: $(cat test_output_float.txt)"
+echo ""
+
+# Test 3: High precision float values
+echo "Test 3: High precision values"
+echo "0 0.001 999.999" > test_output_precision.txt
+echo "Expected: 0ms, 0.001ms, 999.999ms"
+echo "Output: $(cat test_output_precision.txt)"
+echo ""
+
+# Test 4: Edge case - zero values
+echo "Test 4: All zero values"
+echo "0 0 0" > test_output_zeros.txt
+echo "Expected: 0ms, 0ms, 0ms"
+echo "Output: $(cat test_output_zeros.txt)"
+echo ""
+
+# Test 5: Edge case - large values
+echo "Test 5: Large delay values"
+echo "0 5000 10000" > test_output_large.txt
+echo "Expected: 0ms, 5000ms, 10000ms"
+echo "Output: $(cat test_output_large.txt)"
+echo ""
+
+# Test 6: Mixed integer and float values
+echo "Test 6: Mixed integer and float values"
+echo "0 25 50.5" > test_output_mixed.txt
+echo "Expected: 0ms, 25ms, 50.5ms"
+echo "Output: $(cat test_output_mixed.txt)"
+echo ""
+
+# Cleanup
+rm -f test_output_*.txt
+
+echo "All parsing tests completed. These outputs should be parseable by the external command feature."
diff --git a/src/test/regression/tests/041.external_replication_delay/test_validation.sh b/src/test/regression/tests/041.external_replication_delay/test_validation.sh
new file mode 100755
index 000000000..2cd4a7f0b
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/test_validation.sh
@@ -0,0 +1,323 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for external command validation and edge cases
+#
+source $TESTLIBS
+TESTDIR=testdir_validation
+PG_CTL=$PGBIN/pg_ctl
+PSQL="$PGBIN/psql -X "
+
+rm -fr $TESTDIR
+mkdir $TESTDIR
+cd $TESTDIR
+
+# create test environment
+echo -n "creating test environment..."
+$PGPOOL_SETUP -m s -n 3 || exit 1
+echo "done."
+source ./bashrc.ports
+export PGPORT=$PGPOOL_PORT
+
+# Create test command scripts
+# NOTE: All commands output values for REPLICAS only (primary omitted)
+cat > delay_cmd_validation.sh << 'EOF'
+#!/bin/bash
+# Test validation: output with invalid values for 2 replicas
+echo "invalid_value 50.5"
+EOF
+chmod +x delay_cmd_validation.sh
+
+cat > delay_cmd_negative.sh << 'EOF'
+#!/bin/bash
+# Test negative values (other than -1)
+echo "-25 50"
+EOF
+chmod +x delay_cmd_negative.sh
+
+cat > delay_cmd_large.sh << 'EOF'
+#!/bin/bash
+# Test extremely large values
+echo "9999999 50"
+EOF
+chmod +x delay_cmd_large.sh
+
+cat > delay_cmd_wrong_count.sh << 'EOF'
+#!/bin/bash
+# Test wrong number of values (only 1 instead of 2 for 2 replicas)
+echo "25"
+EOF
+chmod +x delay_cmd_wrong_count.sh
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test1: Validation of invalid delay values ==="
+# ----------------------------------------------------------------------------------------
+echo "replication_delay_source_cmd = './delay_cmd_validation.sh'" >> etc/pgpool.conf
+echo "sr_check_period = 1" >> etc/pgpool.conf
+echo "log_standby_delay = 'always'" >> etc/pgpool.conf
+echo "log_min_messages = 'DEBUG1'" >> etc/pgpool.conf
+# Reduce memory requirements for macOS shared memory limits
+echo "num_init_children = 4" >> etc/pgpool.conf
+echo "max_pool = 2" >> etc/pgpool.conf
+# Disable query caching to avoid shared memory issues on macOS
+echo "memory_cache_enabled = off" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+$PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+EOF
+
+# Wait for sr_check to run
+echo "Waiting for validation test..."
+for i in {1..10}; do
+    if grep -q "invalid delay value" log/pgpool.log 2>/dev/null; then
+        echo "Validation error detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for validation warning
+grep "invalid delay value 'invalid_value' for node" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: validation warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: invalid value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test2: Negative delay values (other than -1) ==="
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_validation.sh|delay_cmd_negative.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for negative value test..."
+for i in {1..10}; do
+    if grep -q "negative delay value.*other than -1" log/pgpool.log 2>/dev/null; then
+        echo "Negative value warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for negative value warning
+grep "negative delay value.*other than -1.*treating as 0" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: negative value warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: negative value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test3: Extremely large delay values ==="
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_negative.sh|delay_cmd_large.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for large value test..."
+for i in {1..10}; do
+    if grep -q "extremely large delay value" log/pgpool.log 2>/dev/null; then
+        echo "Large value warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for large value warning
+grep "extremely large delay value.*for node" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: large value warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: large value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test4: Wrong number of output values ==="
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_large.sh|delay_cmd_wrong_count.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for wrong count test..."
+for i in {1..10}; do
+    if grep -q "returned.*values, expected.*replica" log/pgpool.log 2>/dev/null; then
+        echo "Wrong count warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for wrong count warning
+grep "returned.*values, expected.*replica.*Command should output one delay value per replica" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: wrong count validation test not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: wrong count validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test5: Multiple -1 values ==="
+# ----------------------------------------------------------------------------------------
+cat > delay_cmd_multi_down.sh << 'EOF'
+#!/bin/bash
+# Test multiple replicas down
+echo "-1 -1"
+EOF
+chmod +x delay_cmd_multi_down.sh
+
+sed -i.bak "s|delay_cmd_wrong_count.sh|delay_cmd_multi_down.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for multi-down test..."
+for i in {1..10}; do
+    if grep -q "node.*reported as down by external command" log/pgpool.log 2>/dev/null; then
+        echo "Multiple down nodes detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for multiple -1 handling
+DOWN_COUNT=$(grep -c "node.*reported as down by external command.*delay -1" log/pgpool.log)
+if [ "$DOWN_COUNT" -lt 2 ]; then
+    echo fail: expected 2 down node messages, found $DOWN_COUNT
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: multiple -1 handling test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test6: Command timeout with different timeout values ==="
+# ----------------------------------------------------------------------------------------
+cat > delay_cmd_timeout.sh << 'EOF'
+#!/bin/bash
+# Command that takes 5 seconds
+sleep 5
+echo "25 50"
+EOF
+chmod +x delay_cmd_timeout.sh
+
+# Test with timeout shorter than command duration
+sed -i.bak "s|delay_cmd_multi_down.sh|delay_cmd_timeout.sh|" etc/pgpool.conf
+echo "replication_delay_source_timeout = 2" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for timeout
+echo "Waiting for timeout test (2s timeout, 5s command)..."
+for i in {1..10}; do
+    if grep -q "replication delay command timed out after 2 seconds" log/pgpool.log 2>/dev/null; then
+        echo "Timeout detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for timeout message
+grep "replication delay command timed out after 2 seconds" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: timeout not detected
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: timeout test succeeded
+./shutdownall
+
+# Test with timeout longer than command duration
+sed -i.bak "s|replication_delay_source_timeout = 2|replication_delay_source_timeout = 10|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for successful execution
+echo "Waiting for successful execution (10s timeout, 5s command)..."
+for i in {1..15}; do
+    if grep -q "executing replication delay command.*delay_cmd_timeout.sh" log/pgpool.log 2>/dev/null; then
+        echo "Command executed successfully after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Should not timeout this time
+if grep -q "replication delay command timed out" log/pgpool.log 2>/dev/null; then
+    echo fail: command should not have timed out with 10s timeout
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: extended timeout test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test7: Mix of valid delays and -1 ==="
+# ----------------------------------------------------------------------------------------
+cat > delay_cmd_mixed.sh << 'EOF'
+#!/bin/bash
+# One replica up (25ms), one down (-1)
+echo "25 -1"
+EOF
+chmod +x delay_cmd_mixed.sh
+
+sed -i.bak "s|delay_cmd_timeout.sh|delay_cmd_mixed.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check
+echo "Waiting for mixed delay test..."
+for i in {1..10}; do
+    if grep -q "node.*reported as down by external command" log/pgpool.log 2>/dev/null; then
+        echo "Mixed delay handling detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Should log one -1 and process one normal delay
+grep "node.*reported as down by external command.*delay -1" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: -1 not logged
+    ./shutdownall
+    exit 1
+fi
+
+# Should also log the normal replica delay
+grep "Replication of node.*external command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo "Note: Normal replica delay logging may not be visible with log_standby_delay settings"
+fi
+
+echo ok: mixed delay handling test succeeded
+./shutdownall
+
+echo "All validation tests passed!"
+exit 0
\ No newline at end of file
-- 
2.52.0



^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-11-24 07:41  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2025-11-24 07:41 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

Thank you for updating the patch! This time the patch applies without
any issue and compiles fine. Unfortunately regression test failed.

testing 041.external_replication_delay...failed.

From the regression log, it seems Test7 failed.
------------------------------------------------------------------------------
=== Test7: Command timeout handling ===
waiting for server to start....411181 2025-11-24 16:31:05.244 JST LOG:  redirecting log output to logging collector process
411181 2025-11-24 16:31:05.244 JST HINT:  Future log output will appear in directory "log".
 done
server started
waiting for server to start....411196 2025-11-24 16:31:05.352 JST LOG:  redirecting log output to logging collector process
411196 2025-11-24 16:31:05.352 JST HINT:  Future log output will appear in directory "log".
 done
server started
waiting for server to start....411213 2025-11-24 16:31:05.461 JST LOG:  redirecting log output to logging collector process
411213 2025-11-24 16:31:05.461 JST HINT:  Future log output will appear in directory "log".
 done
server started
Waiting for command timeout...
fail: command timeout not detected
------------------------------------------------------------------------------

Attached is the pgpool.log. If you need more info, please let me know.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp


> Hi Tatsuo,
> 
> Sorry again, this was due to the separation of 2 patches and i only sent
> the one.
> 
> I've merged it into 1 commit and 1 patch and rebased over master to avoid
> these issues moving forward.
> 
> PFA latest version
> 
> On Thu, Nov 20, 2025 at 1:09 AM Tatsuo Ishii <[email protected]> wrote:
> 
>> Hi Nadav,
>>
>> Thank you for new patch.
>> Unfortunately the patch did not apply to current master.
>>
>> $ git apply
>> ~/0001-Fix-multiple-issues-in-external-replication-delay-fe.patch
>> error: patch failed: src/streaming_replication/pool_worker_child.c:694
>> error: src/streaming_replication/pool_worker_child.c: patch does not apply
>>
>> Maybe the patch is on top of your previous patch?
>>
>> Also I suggest to use "-v" option of "git format-patch" to add the
>> patch version number so that we can easily know which patch is the
>> latest.
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
>> > Hi Tatsuo,
>> >
>> > Please see attached an updated version.
>> >
>> > thank you
>> >
>> > On Fri, Nov 7, 2025 at 2:07 AM Tatsuo Ishii <[email protected]>
>> wrote:
>> >
>> >> > Sorry for that - thanks for the patch.
>> >> >
>> >> > Please find attached a new version
>> >>
>> >> Thanks for the new version. Unfortunately this time regression test
>> >> fails at:
>> >>
>> >> > Waiting for command timeout...
>> >> > fail: command timeout not detected
>> >>
>> >> Attached is the pgpool.log.
>> >>
>> >> Best regards,
>> >> --
>> >> Tatsuo Ishii
>> >> SRA OSS K.K.
>> >> English: http://www.sraoss.co.jp/index_en/
>> >> Japanese:http://www.sraoss.co.jp
>> >>
>> >> > On Mon, Nov 3, 2025 at 9:05 AM Tatsuo Ishii <[email protected]>
>> >> wrote:
>> >> >
>> >> >> > thanks and sorry for the issues, please find attached updated
>> version.
>> >> >>
>> >> >> No problem.
>> >> >>
>> >> >> This time the patch applies fine, no compiler warnings.  However,
>> >> >> regression test did not passed here (on Ubuntu 24 LTS if this
>> >> >> matters).  So I looked into
>> >> >> src/test/regression/tests/041.external_replication_delay/test.sh a
>> >> >> little bit and apply attached patch (test.sh.patch). It moved forward
>> >> >> partially but failed at:
>> >> >>
>> >> >> fail: command execution failure not detected
>> >> >>
>> >> >> Please find attached
>> >> >>
>> >>
>> src/test/regression/tests/041.external_replication_delay/testdir/pgpool.log
>> >> >> and src/test/regression/log/041.external_replication_delay.
>> >> >>
>> >> >> Best regards,
>> >> >> --
>> >> >> Tatsuo Ishii
>> >> >> SRA OSS K.K.
>> >> >> English: http://www.sraoss.co.jp/index_en/
>> >> >> Japanese:http://www.sraoss.co.jp
>> >> >>
>> >> >
>> >> >
>> >> > --
>> >> > Nadav Shatz
>> >> > Tailor Brands | CTO
>> >>
>> >
>> >
>> > --
>> > Nadav Shatz
>> > Tailor Brands | CTO
>>
> 
> 
> -- 
> Nadav Shatz
> Tailor Brands | CTO

2025-11-24 16:31:05.537: main pid 411227: DEBUG:  initializing pool configuration
2025-11-24 16:31:05.537: main pid 411227: DETAIL:  num_backends: 3 total_weight: 1.000000
2025-11-24 16:31:05.537: main pid 411227: DEBUG:  initializing pool configuration
2025-11-24 16:31:05.537: main pid 411227: DETAIL:  backend 0 weight: 0.000000 flag: 0000
2025-11-24 16:31:05.537: main pid 411227: DEBUG:  initializing pool configuration
2025-11-24 16:31:05.537: main pid 411227: DETAIL:  backend 1 weight: 2147483647.000000 flag: 0000
2025-11-24 16:31:05.537: main pid 411227: DEBUG:  initializing pool configuration
2025-11-24 16:31:05.537: main pid 411227: DETAIL:  backend 2 weight: 0.000000 flag: 0000
2025-11-24 16:31:05.543: main pid 411227: LOG:  Backend status file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/pgpool_status discarded
2025-11-24 16:31:05.543: main pid 411227: DEBUG:  BackendDesc: 113672 bytes requested for shared memory
2025-11-24 16:31:05.543: main pid 411227: DEBUG:  pool_coninfo_size: num_init_children (4) * max_pool (2) * MAX_NUM_BACKENDS (128) * sizeof(ConnectionInfo) (416) = 425984 bytes requested for shared memory
2025-11-24 16:31:05.543: main pid 411227: DEBUG:  ProcessInfo: num_init_children (4) * sizeof(ProcessInfo) (2152) = 8608 bytes requested for shared memory
2025-11-24 16:31:05.543: main pid 411227: DEBUG:  UserSignalSlot: 24 bytes requested for shared memory
2025-11-24 16:31:05.543: main pid 411227: DEBUG:  POOL_REQUEST_INFO: 5272 bytes requested for shared memory
2025-11-24 16:31:05.543: main pid 411227: DEBUG:  stat_shared_memory_size: 9216 bytes requested for shared memory
2025-11-24 16:31:05.543: main pid 411227: DEBUG:  health_check_stats_shared_memory_size: requested size: 12288
2025-11-24 16:31:05.543: main pid 411227: DEBUG:  SI_ManageInfo: 24 bytes requested for shared memory
2025-11-24 16:31:05.543: main pid 411227: DEBUG:  memcache blocks :64
2025-11-24 16:31:05.543: main pid 411227: DEBUG:  shared_memory_cache_size: 67108864
2025-11-24 16:31:05.543: main pid 411227: DEBUG:  shared_memory_fsmm_size: 64
2025-11-24 16:31:05.543: main pid 411227: DEBUG:  pool_hash_size: 67108880
2025-11-24 16:31:05.543: main pid 411227: DEBUG:  POOL_QUERY_CACHE_STATS: 24 bytes requested for shared memory
2025-11-24 16:31:05.543: main pid 411227: LOG:  allocating shared memory segment of size: 134793216 
2025-11-24 16:31:05.668: main pid 411227: DEBUG:  pool_coninfo_size: num_init_children (4) * max_pool (2) * MAX_NUM_BACKENDS (128) * sizeof(ConnectionInfo) (416) = 425984 bytes requested for shared memory
2025-11-24 16:31:05.668: main pid 411227: DEBUG:  health_check_stats_shared_memory_size: requested size: 12288
2025-11-24 16:31:05.668: main pid 411227: DEBUG:  health_check_stats_shared_memory_size: requested size: 12288
2025-11-24 16:31:05.668: main pid 411227: DEBUG:  memcache blocks :64
2025-11-24 16:31:05.668: main pid 411227: DEBUG:  shared_memory_cache_size: 67108864
2025-11-24 16:31:05.668: main pid 411227: DEBUG:  memory cache request size : 67108864
2025-11-24 16:31:05.668: main pid 411227: DEBUG:  shared_memory_fsmm_size: 64
2025-11-24 16:31:05.671: main pid 411227: LOG:  pool_discard_oid_maps: discarded memqcache oid maps
2025-11-24 16:31:05.686: main pid 411227: LOG:  create socket files[0]: /tmp/.s.PGSQL.11000
2025-11-24 16:31:05.686: main pid 411227: LOG:  listen address[0]: *
2025-11-24 16:31:05.687: main pid 411227: LOG:  Setting up socket for 0.0.0.0:11000
2025-11-24 16:31:05.687: main pid 411227: LOG:  Setting up socket for :::11000
2025-11-24 16:31:05.687: main pid 411227: DEBUG:  Spawning 4 child processes
2025-11-24 16:31:05.688: child pid 411234: DEBUG:  initializing backend status
2025-11-24 16:31:05.688: child pid 411235: DEBUG:  initializing backend status
2025-11-24 16:31:05.688: child pid 411236: DEBUG:  initializing backend status
2025-11-24 16:31:05.688: main pid 411227: LOG:  find_primary_node_repeatedly: waiting for finding a primary node
2025-11-24 16:31:05.689: child pid 411237: DEBUG:  initializing backend status
2025-11-24 16:31:05.696: main pid 411227: DEBUG:  authenticate kind = 0
2025-11-24 16:31:05.696: main pid 411227: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:05.696: main pid 411227: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:05.705: main pid 411227: DEBUG:  authenticate kind = 0
2025-11-24 16:31:05.705: main pid 411227: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:05.705: main pid 411227: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:05.714: main pid 411227: DEBUG:  authenticate kind = 0
2025-11-24 16:31:05.714: main pid 411227: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:05.714: main pid 411227: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:05.714: main pid 411227: DEBUG:  pool_acquire_follow_primary_lock: lock was not held by anyone
2025-11-24 16:31:05.714: main pid 411227: DEBUG:  pool_acquire_follow_primary_lock: succeeded in acquiring lock
2025-11-24 16:31:05.714: main pid 411227: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-24 16:31:05.715: main pid 411227: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.current_setting('server_version_num')"
2025-11-24 16:31:05.716: main pid 411227: DEBUG:  get_server_version: backend 0 server version: 180000
2025-11-24 16:31:05.716: main pid 411227: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-24 16:31:05.716: main pid 411227: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.current_setting('server_version_num')"
2025-11-24 16:31:05.717: main pid 411227: DEBUG:  get_server_version: backend 1 server version: 180000
2025-11-24 16:31:05.717: main pid 411227: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-24 16:31:05.717: main pid 411227: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.current_setting('server_version_num')"
2025-11-24 16:31:05.717: main pid 411227: DEBUG:  get_server_version: backend 2 server version: 180000
2025-11-24 16:31:05.717: main pid 411227: DEBUG:  verify_backend_node_status: multiple standbys: 2
2025-11-24 16:31:05.717: main pid 411227: DEBUG:  verify_backend_node_status: detach_false_primary is off and no additional checking is performed
2025-11-24 16:31:05.717: main pid 411227: DEBUG:  pool_release_follow_primary_lock called
2025-11-24 16:31:05.718: main pid 411227: LOG:  find_primary_node: primary node is 0
2025-11-24 16:31:05.718: main pid 411227: LOG:  find_primary_node: standby node is 1
2025-11-24 16:31:05.718: main pid 411227: LOG:  find_primary_node: standby node is 2
2025-11-24 16:31:05.718: main pid 411227: LOG:  create socket files[0]: /tmp/.s.PGSQL.11001
2025-11-24 16:31:05.718: main pid 411227: LOG:  listen address[0]: localhost
2025-11-24 16:31:05.718: main pid 411227: LOG:  Setting up socket for 127.0.0.1:11001
2025-11-24 16:31:05.718: pcp_main pid 411241: DEBUG:  I am PCP child with pid:411241
2025-11-24 16:31:05.719: pcp_main pid 411241: LOG:  PCP process: 411241 started
2025-11-24 16:31:05.719: sr_check_worker pid 411242: LOG:  process started
2025-11-24 16:31:05.719: sr_check_worker pid 411242: DEBUG:  I am 411242
2025-11-24 16:31:05.719: sr_check_worker pid 411242: DEBUG:  pool_acquire_follow_primary_lock: lock was not held by anyone
2025-11-24 16:31:05.719: sr_check_worker pid 411242: DEBUG:  pool_acquire_follow_primary_lock: succeeded in acquiring lock
2025-11-24 16:31:05.719: health_check pid 411243: LOG:  process started
2025-11-24 16:31:05.719: health_check0 pid 411243: DEBUG:  I am health check process pid:411243 DB node id:0
2025-11-24 16:31:05.719: health_check0 pid 411243: DEBUG:  health check: clearing alarm
2025-11-24 16:31:05.719: health_check pid 411244: LOG:  process started
2025-11-24 16:31:05.719: health_check1 pid 411244: DEBUG:  I am health check process pid:411244 DB node id:1
2025-11-24 16:31:05.719: health_check1 pid 411244: DEBUG:  health check: clearing alarm
2025-11-24 16:31:05.719: health_check pid 411245: LOG:  process started
2025-11-24 16:31:05.720: health_check2 pid 411245: DEBUG:  I am health check process pid:411245 DB node id:2
2025-11-24 16:31:05.720: health_check2 pid 411245: DEBUG:  health check: clearing alarm
2025-11-24 16:31:05.723: sr_check_worker pid 411242: DEBUG:  authenticate kind = 0
2025-11-24 16:31:05.723: health_check1 pid 411244: DEBUG:  authenticate kind = 0
2025-11-24 16:31:05.723: sr_check_worker pid 411242: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:05.723: sr_check_worker pid 411242: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:05.723: health_check1 pid 411244: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:05.723: health_check1 pid 411244: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:05.723: health_check1 pid 411244: DEBUG:  health check: clearing alarm
2025-11-24 16:31:05.723: health_check1 pid 411244: DEBUG:  health check: clearing alarm
2025-11-24 16:31:05.723: health_check1 pid 411244: WARNING:  check_backend_down_request: failed to open file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/backend_down_request
2025-11-24 16:31:05.723: health_check1 pid 411244: DETAIL:  No such file or directory
2025-11-24 16:31:05.723: health_check2 pid 411245: DEBUG:  authenticate kind = 0
2025-11-24 16:31:05.723: health_check2 pid 411245: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:05.723: health_check2 pid 411245: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:05.723: health_check2 pid 411245: DEBUG:  health check: clearing alarm
2025-11-24 16:31:05.723: health_check2 pid 411245: DEBUG:  health check: clearing alarm
2025-11-24 16:31:05.723: health_check2 pid 411245: WARNING:  check_backend_down_request: failed to open file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/backend_down_request
2025-11-24 16:31:05.723: health_check2 pid 411245: DETAIL:  No such file or directory
2025-11-24 16:31:05.724: health_check0 pid 411243: DEBUG:  authenticate kind = 0
2025-11-24 16:31:05.724: health_check0 pid 411243: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:05.724: health_check0 pid 411243: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:05.724: health_check0 pid 411243: DEBUG:  health check: clearing alarm
2025-11-24 16:31:05.724: health_check0 pid 411243: DEBUG:  health check: clearing alarm
2025-11-24 16:31:05.724: health_check0 pid 411243: WARNING:  check_backend_down_request: failed to open file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/backend_down_request
2025-11-24 16:31:05.724: health_check0 pid 411243: DETAIL:  No such file or directory
2025-11-24 16:31:05.727: main pid 411227: LOG:  pgpool-II successfully started. version 4.8devel (mitsukakeboshi)
2025-11-24 16:31:05.727: main pid 411227: LOG:  node status[0]: 1
2025-11-24 16:31:05.727: main pid 411227: LOG:  node status[1]: 2
2025-11-24 16:31:05.727: main pid 411227: LOG:  node status[2]: 2
2025-11-24 16:31:05.728: sr_check_worker pid 411242: DEBUG:  authenticate kind = 0
2025-11-24 16:31:05.728: sr_check_worker pid 411242: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:05.728: sr_check_worker pid 411242: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:05.732: sr_check_worker pid 411242: DEBUG:  authenticate kind = 0
2025-11-24 16:31:05.732: sr_check_worker pid 411242: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:05.732: sr_check_worker pid 411242: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:05.732: sr_check_worker pid 411242: DEBUG:  executing replication delay command: ./delay_cmd_slow.sh localhost:11003 localhost:11004
2025-11-24 16:31:05.732: sr_check_worker pid 411242: CONTEXT:  while checking replication time lag
2025-11-24 16:31:06.548: child pid 411237: DEBUG:  I am 411237 accept fd 7
2025-11-24 16:31:06.549: child pid 411237: DEBUG:  reading startup packet
2025-11-24 16:31:06.549: child pid 411237: DETAIL:  application_name: psql
2025-11-24 16:31:06.549: child pid 411237: DEBUG:  reading startup packet
2025-11-24 16:31:06.549: child pid 411237: DETAIL:  Protocol Major: 3 Minor: 0 database: test user: t-ishii
2025-11-24 16:31:06.549: child pid 411237: DEBUG:  creating new connection to backend
2025-11-24 16:31:06.549: child pid 411237: DETAIL:  connecting 0 backend
2025-11-24 16:31:06.549: child pid 411237: DEBUG:  creating new connection to backend
2025-11-24 16:31:06.549: child pid 411237: DETAIL:  connecting 1 backend
2025-11-24 16:31:06.549: child pid 411237: DEBUG:  creating new connection to backend
2025-11-24 16:31:06.549: child pid 411237: DETAIL:  connecting 2 backend
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  authentication backend
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  auth kind:0
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  reading message length
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  message length (22) in slot 1 does not match with slot 0(23)
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  reading message length
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  message length (22) in slot 2 does not match with slot 0(23)
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:0 name:"in_hot_standby" value:"off"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:1 name:"in_hot_standby" value:"on"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:2 name:"in_hot_standby" value:"on"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:0 name:"integer_datetimes" value:"on"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:1 name:"integer_datetimes" value:"on"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:2 name:"integer_datetimes" value:"on"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:0 name:"TimeZone" value:"Asia/Tokyo"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:1 name:"TimeZone" value:"Asia/Tokyo"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:2 name:"TimeZone" value:"Asia/Tokyo"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:0 name:"IntervalStyle" value:"postgres"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:1 name:"IntervalStyle" value:"postgres"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:2 name:"IntervalStyle" value:"postgres"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:0 name:"search_path" value:""$user", public"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:1 name:"search_path" value:""$user", public"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:2 name:"search_path" value:""$user", public"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:0 name:"is_superuser" value:"on"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:1 name:"is_superuser" value:"on"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:2 name:"is_superuser" value:"on"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:0 name:"application_name" value:"psql"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:1 name:"application_name" value:"psql"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:2 name:"application_name" value:"psql"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:0 name:"default_transaction_read_only" value:"off"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:1 name:"default_transaction_read_only" value:"off"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:2 name:"default_transaction_read_only" value:"off"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:0 name:"scram_iterations" value:"4096"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:1 name:"scram_iterations" value:"4096"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:2 name:"scram_iterations" value:"4096"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:0 name:"DateStyle" value:"ISO, MDY"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:1 name:"DateStyle" value:"ISO, MDY"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:2 name:"DateStyle" value:"ISO, MDY"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:0 name:"standard_conforming_strings" value:"on"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:1 name:"standard_conforming_strings" value:"on"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:2 name:"standard_conforming_strings" value:"on"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:0 name:"session_authorization" value:"t-ishii"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:1 name:"session_authorization" value:"t-ishii"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:2 name:"session_authorization" value:"t-ishii"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:0 name:"client_encoding" value:"UTF8"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:1 name:"client_encoding" value:"UTF8"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:2 name:"client_encoding" value:"UTF8"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:0 name:"server_version" value:"18.0"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:1 name:"server_version" value:"18.0"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:2 name:"server_version" value:"18.0"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:0 name:"server_encoding" value:"UTF8"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:1 name:"server_encoding" value:"UTF8"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:2 name:"server_encoding" value:"UTF8"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  cancel key length: 4
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  authentication backend
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  cp->info[i]:0x7e3902ddfc08 pid:411256
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  authentication backend
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  cp->info[i]:0x7e3902ddfda8 pid:411257
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  authentication backend
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  cp->info[i]:0x7e3902ddff48 pid:411258
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  sending backend key data
2025-11-24 16:31:06.570: psql pid 411237: DEBUG:  selecting load balance node
2025-11-24 16:31:06.570: psql pid 411237: DETAIL:  selected backend id is 1
2025-11-24 16:31:06.571: psql pid 411237: LOG:  DB node id: 0 backend pid: 411256 statement: SELECT pg_catalog.version()
2025-11-24 16:31:06.571: psql pid 411237: DEBUG:  memcache encode key
2025-11-24 16:31:06.571: psql pid 411237: DETAIL:  username: "t-ishii" database_name: "test"
2025-11-24 16:31:06.571: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.571: psql pid 411237: DEBUG:  memcache encode key
2025-11-24 16:31:06.571: psql pid 411237: DETAIL:  query: "SELECT pg_catalog.version()"
2025-11-24 16:31:06.571: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.571: psql pid 411237: DEBUG:  memcache encode key
2025-11-24 16:31:06.571: psql pid 411237: DETAIL:  `t-ishiiSELECT pg_catalog.version()test' -> `c8645f9bdf015b6b5ee4667cb578f1b3'
2025-11-24 16:31:06.571: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.571: psql pid 411237: DEBUG:  fetching from cache storage
2025-11-24 16:31:06.571: psql pid 411237: DETAIL:  search key "c8645f9bdf015b6b5ee4667cb578f1b3"
2025-11-24 16:31:06.571: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.571: psql pid 411237: DEBUG:  fetching from cache storage
2025-11-24 16:31:06.571: psql pid 411237: DETAIL:  cache not found on shared memory
2025-11-24 16:31:06.571: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.571: psql pid 411237: DEBUG:  not hit local relation cache and query cache
2025-11-24 16:31:06.571: psql pid 411237: DETAIL:  query:SELECT pg_catalog.version()
2025-11-24 16:31:06.571: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.571: psql pid 411237: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.version()"
2025-11-24 16:31:06.571: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.572: psql pid 411237: DEBUG:  committing relation cache to cache storage
2025-11-24 16:31:06.572: psql pid 411237: DETAIL:  Query="SELECT pg_catalog.version()"
2025-11-24 16:31:06.572: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.572: psql pid 411237: DEBUG:  memcache encode key
2025-11-24 16:31:06.572: psql pid 411237: DETAIL:  username: "t-ishii" database_name: "test"
2025-11-24 16:31:06.572: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.573: psql pid 411237: DEBUG:  memcache encode key
2025-11-24 16:31:06.573: psql pid 411237: DETAIL:  query: "SELECT pg_catalog.version()"
2025-11-24 16:31:06.573: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.573: psql pid 411237: DEBUG:  memcache encode key
2025-11-24 16:31:06.573: psql pid 411237: DETAIL:  `t-ishiiSELECT pg_catalog.version()test' -> `c8645f9bdf015b6b5ee4667cb578f1b3'
2025-11-24 16:31:06.573: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.573: psql pid 411237: DEBUG:  committing relation cache to cache storage
2025-11-24 16:31:06.573: psql pid 411237: DETAIL:  memqcache_expire = 0
2025-11-24 16:31:06.573: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.573: psql pid 411237: DEBUG:  memcache adding item
2025-11-24 16:31:06.573: psql pid 411237: DETAIL:  new item inserted. blockid: 0 itemid:0
2025-11-24 16:31:06.573: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.573: psql pid 411237: DEBUG:  memcache adding item
2025-11-24 16:31:06.573: psql pid 411237: DETAIL:  block: 0 item: 0
2025-11-24 16:31:06.573: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.573: psql pid 411237: DEBUG:  SimpleQuery
2025-11-24 16:31:06.573: psql pid 411237: DETAIL:  nodes reporting
2025-11-24 16:31:06.582: psql pid 411237: DEBUG:  authenticate kind = 0
2025-11-24 16:31:06.582: psql pid 411237: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:06.582: psql pid 411237: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:06.582: psql pid 411237: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-24 16:31:06.591: psql pid 411237: DEBUG:  authenticate kind = 0
2025-11-24 16:31:06.591: psql pid 411237: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:06.591: psql pid 411237: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:06.591: psql pid 411237: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-24 16:31:06.598: psql pid 411237: DEBUG:  authenticate kind = 0
2025-11-24 16:31:06.598: psql pid 411237: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:06.598: psql pid 411237: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:06.598: psql pid 411237: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-24 16:31:06.599: psql pid 411237: DEBUG:  decide where to send the query
2025-11-24 16:31:06.599: psql pid 411237: DETAIL:  destination = 3 for query= "DISCARD ALL"
2025-11-24 16:31:06.599: psql pid 411237: LOG:  DB node id: 0 backend pid: 411256 statement: DISCARD ALL
2025-11-24 16:31:06.599: psql pid 411237: DEBUG:  waiting for query response
2025-11-24 16:31:06.599: psql pid 411237: DETAIL:  waiting for backend:0 to complete the query
2025-11-24 16:31:06.600: psql pid 411237: LOG:  DB node id: 1 backend pid: 411257 statement: DISCARD ALL
2025-11-24 16:31:06.600: psql pid 411237: DEBUG:  waiting for query response
2025-11-24 16:31:06.600: psql pid 411237: DETAIL:  waiting for backend:1 to complete the query
2025-11-24 16:31:06.600: psql pid 411237: DEBUG:  setting backend connection close timer
2025-11-24 16:31:06.600: psql pid 411237: DETAIL:  close time 1763969466
2025-11-24 16:31:15.724: health_check2 pid 411245: DEBUG:  health check: clearing alarm
2025-11-24 16:31:15.724: health_check1 pid 411244: DEBUG:  health check: clearing alarm
2025-11-24 16:31:15.724: health_check0 pid 411243: DEBUG:  health check: clearing alarm
2025-11-24 16:31:15.729: health_check2 pid 411245: DEBUG:  authenticate kind = 0
2025-11-24 16:31:15.729: health_check2 pid 411245: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:15.729: health_check2 pid 411245: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:15.729: health_check2 pid 411245: DEBUG:  health check: clearing alarm
2025-11-24 16:31:15.729: health_check2 pid 411245: DEBUG:  health check: clearing alarm
2025-11-24 16:31:15.729: health_check2 pid 411245: WARNING:  check_backend_down_request: failed to open file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/backend_down_request
2025-11-24 16:31:15.729: health_check2 pid 411245: DETAIL:  No such file or directory
2025-11-24 16:31:15.729: health_check0 pid 411243: DEBUG:  authenticate kind = 0
2025-11-24 16:31:15.729: health_check0 pid 411243: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:15.729: health_check0 pid 411243: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:15.729: health_check0 pid 411243: DEBUG:  health check: clearing alarm
2025-11-24 16:31:15.729: health_check0 pid 411243: DEBUG:  health check: clearing alarm
2025-11-24 16:31:15.729: health_check1 pid 411244: DEBUG:  authenticate kind = 0
2025-11-24 16:31:15.729: health_check1 pid 411244: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:15.729: health_check1 pid 411244: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:15.729: health_check1 pid 411244: DEBUG:  health check: clearing alarm
2025-11-24 16:31:15.729: health_check1 pid 411244: DEBUG:  health check: clearing alarm
2025-11-24 16:31:15.729: health_check0 pid 411243: WARNING:  check_backend_down_request: failed to open file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/backend_down_request
2025-11-24 16:31:15.729: health_check0 pid 411243: DETAIL:  No such file or directory
2025-11-24 16:31:15.729: health_check1 pid 411244: WARNING:  check_backend_down_request: failed to open file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/backend_down_request
2025-11-24 16:31:15.729: health_check1 pid 411244: DETAIL:  No such file or directory
2025-11-24 16:31:20.738: sr_check_worker pid 411242: LOG:  Replication of node: 1 is behind 0.025 second(s) from the primary server (node: 0) [external command]
2025-11-24 16:31:20.738: sr_check_worker pid 411242: CONTEXT:  while checking replication time lag
2025-11-24 16:31:20.738: sr_check_worker pid 411242: LOG:  Replication of node: 2 is behind 0.050 second(s) from the primary server (node: 0) [external command]
2025-11-24 16:31:20.738: sr_check_worker pid 411242: CONTEXT:  while checking replication time lag
2025-11-24 16:31:20.738: sr_check_worker pid 411242: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-24 16:31:20.740: sr_check_worker pid 411242: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-24 16:31:20.741: sr_check_worker pid 411242: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-24 16:31:20.742: sr_check_worker pid 411242: DEBUG:  verify_backend_node_status: multiple standbys: 2
2025-11-24 16:31:20.742: sr_check_worker pid 411242: DEBUG:  verify_backend_node_status: detach_false_primary is off and no additional checking is performed
2025-11-24 16:31:20.742: sr_check_worker pid 411242: DEBUG:  node status[0]: 1
2025-11-24 16:31:20.742: sr_check_worker pid 411242: DEBUG:  node status[1]: 2
2025-11-24 16:31:20.742: sr_check_worker pid 411242: DEBUG:  node status[2]: 2
2025-11-24 16:31:20.742: sr_check_worker pid 411242: DEBUG:  pool_release_follow_primary_lock called
2025-11-24 16:31:21.696: main pid 411227: LOG:  exit handler called (signal: 2)
2025-11-24 16:31:21.696: main pid 411227: LOG:  shutting down by signal 2
2025-11-24 16:31:21.696: main pid 411227: LOG:  terminating all child processes
2025-11-24 16:31:21.699: main pid 411227: LOG:  Pgpool-II system is shutdown


Attachments:

  [text/plain] pgpool.log (32.0K, 2-pgpool.log)
  download | inline:
2025-11-24 16:31:05.537: main pid 411227: DEBUG:  initializing pool configuration
2025-11-24 16:31:05.537: main pid 411227: DETAIL:  num_backends: 3 total_weight: 1.000000
2025-11-24 16:31:05.537: main pid 411227: DEBUG:  initializing pool configuration
2025-11-24 16:31:05.537: main pid 411227: DETAIL:  backend 0 weight: 0.000000 flag: 0000
2025-11-24 16:31:05.537: main pid 411227: DEBUG:  initializing pool configuration
2025-11-24 16:31:05.537: main pid 411227: DETAIL:  backend 1 weight: 2147483647.000000 flag: 0000
2025-11-24 16:31:05.537: main pid 411227: DEBUG:  initializing pool configuration
2025-11-24 16:31:05.537: main pid 411227: DETAIL:  backend 2 weight: 0.000000 flag: 0000
2025-11-24 16:31:05.543: main pid 411227: LOG:  Backend status file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/pgpool_status discarded
2025-11-24 16:31:05.543: main pid 411227: DEBUG:  BackendDesc: 113672 bytes requested for shared memory
2025-11-24 16:31:05.543: main pid 411227: DEBUG:  pool_coninfo_size: num_init_children (4) * max_pool (2) * MAX_NUM_BACKENDS (128) * sizeof(ConnectionInfo) (416) = 425984 bytes requested for shared memory
2025-11-24 16:31:05.543: main pid 411227: DEBUG:  ProcessInfo: num_init_children (4) * sizeof(ProcessInfo) (2152) = 8608 bytes requested for shared memory
2025-11-24 16:31:05.543: main pid 411227: DEBUG:  UserSignalSlot: 24 bytes requested for shared memory
2025-11-24 16:31:05.543: main pid 411227: DEBUG:  POOL_REQUEST_INFO: 5272 bytes requested for shared memory
2025-11-24 16:31:05.543: main pid 411227: DEBUG:  stat_shared_memory_size: 9216 bytes requested for shared memory
2025-11-24 16:31:05.543: main pid 411227: DEBUG:  health_check_stats_shared_memory_size: requested size: 12288
2025-11-24 16:31:05.543: main pid 411227: DEBUG:  SI_ManageInfo: 24 bytes requested for shared memory
2025-11-24 16:31:05.543: main pid 411227: DEBUG:  memcache blocks :64
2025-11-24 16:31:05.543: main pid 411227: DEBUG:  shared_memory_cache_size: 67108864
2025-11-24 16:31:05.543: main pid 411227: DEBUG:  shared_memory_fsmm_size: 64
2025-11-24 16:31:05.543: main pid 411227: DEBUG:  pool_hash_size: 67108880
2025-11-24 16:31:05.543: main pid 411227: DEBUG:  POOL_QUERY_CACHE_STATS: 24 bytes requested for shared memory
2025-11-24 16:31:05.543: main pid 411227: LOG:  allocating shared memory segment of size: 134793216 
2025-11-24 16:31:05.668: main pid 411227: DEBUG:  pool_coninfo_size: num_init_children (4) * max_pool (2) * MAX_NUM_BACKENDS (128) * sizeof(ConnectionInfo) (416) = 425984 bytes requested for shared memory
2025-11-24 16:31:05.668: main pid 411227: DEBUG:  health_check_stats_shared_memory_size: requested size: 12288
2025-11-24 16:31:05.668: main pid 411227: DEBUG:  health_check_stats_shared_memory_size: requested size: 12288
2025-11-24 16:31:05.668: main pid 411227: DEBUG:  memcache blocks :64
2025-11-24 16:31:05.668: main pid 411227: DEBUG:  shared_memory_cache_size: 67108864
2025-11-24 16:31:05.668: main pid 411227: DEBUG:  memory cache request size : 67108864
2025-11-24 16:31:05.668: main pid 411227: DEBUG:  shared_memory_fsmm_size: 64
2025-11-24 16:31:05.671: main pid 411227: LOG:  pool_discard_oid_maps: discarded memqcache oid maps
2025-11-24 16:31:05.686: main pid 411227: LOG:  create socket files[0]: /tmp/.s.PGSQL.11000
2025-11-24 16:31:05.686: main pid 411227: LOG:  listen address[0]: *
2025-11-24 16:31:05.687: main pid 411227: LOG:  Setting up socket for 0.0.0.0:11000
2025-11-24 16:31:05.687: main pid 411227: LOG:  Setting up socket for :::11000
2025-11-24 16:31:05.687: main pid 411227: DEBUG:  Spawning 4 child processes
2025-11-24 16:31:05.688: child pid 411234: DEBUG:  initializing backend status
2025-11-24 16:31:05.688: child pid 411235: DEBUG:  initializing backend status
2025-11-24 16:31:05.688: child pid 411236: DEBUG:  initializing backend status
2025-11-24 16:31:05.688: main pid 411227: LOG:  find_primary_node_repeatedly: waiting for finding a primary node
2025-11-24 16:31:05.689: child pid 411237: DEBUG:  initializing backend status
2025-11-24 16:31:05.696: main pid 411227: DEBUG:  authenticate kind = 0
2025-11-24 16:31:05.696: main pid 411227: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:05.696: main pid 411227: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:05.705: main pid 411227: DEBUG:  authenticate kind = 0
2025-11-24 16:31:05.705: main pid 411227: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:05.705: main pid 411227: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:05.714: main pid 411227: DEBUG:  authenticate kind = 0
2025-11-24 16:31:05.714: main pid 411227: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:05.714: main pid 411227: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:05.714: main pid 411227: DEBUG:  pool_acquire_follow_primary_lock: lock was not held by anyone
2025-11-24 16:31:05.714: main pid 411227: DEBUG:  pool_acquire_follow_primary_lock: succeeded in acquiring lock
2025-11-24 16:31:05.714: main pid 411227: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-24 16:31:05.715: main pid 411227: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.current_setting('server_version_num')"
2025-11-24 16:31:05.716: main pid 411227: DEBUG:  get_server_version: backend 0 server version: 180000
2025-11-24 16:31:05.716: main pid 411227: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-24 16:31:05.716: main pid 411227: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.current_setting('server_version_num')"
2025-11-24 16:31:05.717: main pid 411227: DEBUG:  get_server_version: backend 1 server version: 180000
2025-11-24 16:31:05.717: main pid 411227: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-24 16:31:05.717: main pid 411227: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.current_setting('server_version_num')"
2025-11-24 16:31:05.717: main pid 411227: DEBUG:  get_server_version: backend 2 server version: 180000
2025-11-24 16:31:05.717: main pid 411227: DEBUG:  verify_backend_node_status: multiple standbys: 2
2025-11-24 16:31:05.717: main pid 411227: DEBUG:  verify_backend_node_status: detach_false_primary is off and no additional checking is performed
2025-11-24 16:31:05.717: main pid 411227: DEBUG:  pool_release_follow_primary_lock called
2025-11-24 16:31:05.718: main pid 411227: LOG:  find_primary_node: primary node is 0
2025-11-24 16:31:05.718: main pid 411227: LOG:  find_primary_node: standby node is 1
2025-11-24 16:31:05.718: main pid 411227: LOG:  find_primary_node: standby node is 2
2025-11-24 16:31:05.718: main pid 411227: LOG:  create socket files[0]: /tmp/.s.PGSQL.11001
2025-11-24 16:31:05.718: main pid 411227: LOG:  listen address[0]: localhost
2025-11-24 16:31:05.718: main pid 411227: LOG:  Setting up socket for 127.0.0.1:11001
2025-11-24 16:31:05.718: pcp_main pid 411241: DEBUG:  I am PCP child with pid:411241
2025-11-24 16:31:05.719: pcp_main pid 411241: LOG:  PCP process: 411241 started
2025-11-24 16:31:05.719: sr_check_worker pid 411242: LOG:  process started
2025-11-24 16:31:05.719: sr_check_worker pid 411242: DEBUG:  I am 411242
2025-11-24 16:31:05.719: sr_check_worker pid 411242: DEBUG:  pool_acquire_follow_primary_lock: lock was not held by anyone
2025-11-24 16:31:05.719: sr_check_worker pid 411242: DEBUG:  pool_acquire_follow_primary_lock: succeeded in acquiring lock
2025-11-24 16:31:05.719: health_check pid 411243: LOG:  process started
2025-11-24 16:31:05.719: health_check0 pid 411243: DEBUG:  I am health check process pid:411243 DB node id:0
2025-11-24 16:31:05.719: health_check0 pid 411243: DEBUG:  health check: clearing alarm
2025-11-24 16:31:05.719: health_check pid 411244: LOG:  process started
2025-11-24 16:31:05.719: health_check1 pid 411244: DEBUG:  I am health check process pid:411244 DB node id:1
2025-11-24 16:31:05.719: health_check1 pid 411244: DEBUG:  health check: clearing alarm
2025-11-24 16:31:05.719: health_check pid 411245: LOG:  process started
2025-11-24 16:31:05.720: health_check2 pid 411245: DEBUG:  I am health check process pid:411245 DB node id:2
2025-11-24 16:31:05.720: health_check2 pid 411245: DEBUG:  health check: clearing alarm
2025-11-24 16:31:05.723: sr_check_worker pid 411242: DEBUG:  authenticate kind = 0
2025-11-24 16:31:05.723: health_check1 pid 411244: DEBUG:  authenticate kind = 0
2025-11-24 16:31:05.723: sr_check_worker pid 411242: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:05.723: sr_check_worker pid 411242: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:05.723: health_check1 pid 411244: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:05.723: health_check1 pid 411244: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:05.723: health_check1 pid 411244: DEBUG:  health check: clearing alarm
2025-11-24 16:31:05.723: health_check1 pid 411244: DEBUG:  health check: clearing alarm
2025-11-24 16:31:05.723: health_check1 pid 411244: WARNING:  check_backend_down_request: failed to open file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/backend_down_request
2025-11-24 16:31:05.723: health_check1 pid 411244: DETAIL:  No such file or directory
2025-11-24 16:31:05.723: health_check2 pid 411245: DEBUG:  authenticate kind = 0
2025-11-24 16:31:05.723: health_check2 pid 411245: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:05.723: health_check2 pid 411245: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:05.723: health_check2 pid 411245: DEBUG:  health check: clearing alarm
2025-11-24 16:31:05.723: health_check2 pid 411245: DEBUG:  health check: clearing alarm
2025-11-24 16:31:05.723: health_check2 pid 411245: WARNING:  check_backend_down_request: failed to open file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/backend_down_request
2025-11-24 16:31:05.723: health_check2 pid 411245: DETAIL:  No such file or directory
2025-11-24 16:31:05.724: health_check0 pid 411243: DEBUG:  authenticate kind = 0
2025-11-24 16:31:05.724: health_check0 pid 411243: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:05.724: health_check0 pid 411243: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:05.724: health_check0 pid 411243: DEBUG:  health check: clearing alarm
2025-11-24 16:31:05.724: health_check0 pid 411243: DEBUG:  health check: clearing alarm
2025-11-24 16:31:05.724: health_check0 pid 411243: WARNING:  check_backend_down_request: failed to open file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/backend_down_request
2025-11-24 16:31:05.724: health_check0 pid 411243: DETAIL:  No such file or directory
2025-11-24 16:31:05.727: main pid 411227: LOG:  pgpool-II successfully started. version 4.8devel (mitsukakeboshi)
2025-11-24 16:31:05.727: main pid 411227: LOG:  node status[0]: 1
2025-11-24 16:31:05.727: main pid 411227: LOG:  node status[1]: 2
2025-11-24 16:31:05.727: main pid 411227: LOG:  node status[2]: 2
2025-11-24 16:31:05.728: sr_check_worker pid 411242: DEBUG:  authenticate kind = 0
2025-11-24 16:31:05.728: sr_check_worker pid 411242: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:05.728: sr_check_worker pid 411242: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:05.732: sr_check_worker pid 411242: DEBUG:  authenticate kind = 0
2025-11-24 16:31:05.732: sr_check_worker pid 411242: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:05.732: sr_check_worker pid 411242: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:05.732: sr_check_worker pid 411242: DEBUG:  executing replication delay command: ./delay_cmd_slow.sh localhost:11003 localhost:11004
2025-11-24 16:31:05.732: sr_check_worker pid 411242: CONTEXT:  while checking replication time lag
2025-11-24 16:31:06.548: child pid 411237: DEBUG:  I am 411237 accept fd 7
2025-11-24 16:31:06.549: child pid 411237: DEBUG:  reading startup packet
2025-11-24 16:31:06.549: child pid 411237: DETAIL:  application_name: psql
2025-11-24 16:31:06.549: child pid 411237: DEBUG:  reading startup packet
2025-11-24 16:31:06.549: child pid 411237: DETAIL:  Protocol Major: 3 Minor: 0 database: test user: t-ishii
2025-11-24 16:31:06.549: child pid 411237: DEBUG:  creating new connection to backend
2025-11-24 16:31:06.549: child pid 411237: DETAIL:  connecting 0 backend
2025-11-24 16:31:06.549: child pid 411237: DEBUG:  creating new connection to backend
2025-11-24 16:31:06.549: child pid 411237: DETAIL:  connecting 1 backend
2025-11-24 16:31:06.549: child pid 411237: DEBUG:  creating new connection to backend
2025-11-24 16:31:06.549: child pid 411237: DETAIL:  connecting 2 backend
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  authentication backend
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  auth kind:0
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  reading message length
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  message length (22) in slot 1 does not match with slot 0(23)
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  reading message length
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  message length (22) in slot 2 does not match with slot 0(23)
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:0 name:"in_hot_standby" value:"off"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:1 name:"in_hot_standby" value:"on"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:2 name:"in_hot_standby" value:"on"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:0 name:"integer_datetimes" value:"on"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:1 name:"integer_datetimes" value:"on"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:2 name:"integer_datetimes" value:"on"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:0 name:"TimeZone" value:"Asia/Tokyo"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:1 name:"TimeZone" value:"Asia/Tokyo"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:2 name:"TimeZone" value:"Asia/Tokyo"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:0 name:"IntervalStyle" value:"postgres"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:1 name:"IntervalStyle" value:"postgres"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:2 name:"IntervalStyle" value:"postgres"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:0 name:"search_path" value:""$user", public"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:1 name:"search_path" value:""$user", public"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:2 name:"search_path" value:""$user", public"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:0 name:"is_superuser" value:"on"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:1 name:"is_superuser" value:"on"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:2 name:"is_superuser" value:"on"
2025-11-24 16:31:06.567: child pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: child pid 411237: DETAIL:  backend:0 name:"application_name" value:"psql"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:1 name:"application_name" value:"psql"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:2 name:"application_name" value:"psql"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:0 name:"default_transaction_read_only" value:"off"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:1 name:"default_transaction_read_only" value:"off"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:2 name:"default_transaction_read_only" value:"off"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:0 name:"scram_iterations" value:"4096"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:1 name:"scram_iterations" value:"4096"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:2 name:"scram_iterations" value:"4096"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:0 name:"DateStyle" value:"ISO, MDY"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:1 name:"DateStyle" value:"ISO, MDY"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:2 name:"DateStyle" value:"ISO, MDY"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:0 name:"standard_conforming_strings" value:"on"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:1 name:"standard_conforming_strings" value:"on"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:2 name:"standard_conforming_strings" value:"on"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:0 name:"session_authorization" value:"t-ishii"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:1 name:"session_authorization" value:"t-ishii"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:2 name:"session_authorization" value:"t-ishii"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:0 name:"client_encoding" value:"UTF8"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:1 name:"client_encoding" value:"UTF8"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:2 name:"client_encoding" value:"UTF8"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:0 name:"server_version" value:"18.0"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:1 name:"server_version" value:"18.0"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:2 name:"server_version" value:"18.0"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:0 name:"server_encoding" value:"UTF8"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:1 name:"server_encoding" value:"UTF8"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  process parameter status
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  backend:2 name:"server_encoding" value:"UTF8"
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  cancel key length: 4
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  authentication backend
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  cp->info[i]:0x7e3902ddfc08 pid:411256
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  authentication backend
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  cp->info[i]:0x7e3902ddfda8 pid:411257
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  authentication backend
2025-11-24 16:31:06.567: psql pid 411237: DETAIL:  cp->info[i]:0x7e3902ddff48 pid:411258
2025-11-24 16:31:06.567: psql pid 411237: DEBUG:  sending backend key data
2025-11-24 16:31:06.570: psql pid 411237: DEBUG:  selecting load balance node
2025-11-24 16:31:06.570: psql pid 411237: DETAIL:  selected backend id is 1
2025-11-24 16:31:06.571: psql pid 411237: LOG:  DB node id: 0 backend pid: 411256 statement: SELECT pg_catalog.version()
2025-11-24 16:31:06.571: psql pid 411237: DEBUG:  memcache encode key
2025-11-24 16:31:06.571: psql pid 411237: DETAIL:  username: "t-ishii" database_name: "test"
2025-11-24 16:31:06.571: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.571: psql pid 411237: DEBUG:  memcache encode key
2025-11-24 16:31:06.571: psql pid 411237: DETAIL:  query: "SELECT pg_catalog.version()"
2025-11-24 16:31:06.571: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.571: psql pid 411237: DEBUG:  memcache encode key
2025-11-24 16:31:06.571: psql pid 411237: DETAIL:  `t-ishiiSELECT pg_catalog.version()test' -> `c8645f9bdf015b6b5ee4667cb578f1b3'
2025-11-24 16:31:06.571: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.571: psql pid 411237: DEBUG:  fetching from cache storage
2025-11-24 16:31:06.571: psql pid 411237: DETAIL:  search key "c8645f9bdf015b6b5ee4667cb578f1b3"
2025-11-24 16:31:06.571: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.571: psql pid 411237: DEBUG:  fetching from cache storage
2025-11-24 16:31:06.571: psql pid 411237: DETAIL:  cache not found on shared memory
2025-11-24 16:31:06.571: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.571: psql pid 411237: DEBUG:  not hit local relation cache and query cache
2025-11-24 16:31:06.571: psql pid 411237: DETAIL:  query:SELECT pg_catalog.version()
2025-11-24 16:31:06.571: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.571: psql pid 411237: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.version()"
2025-11-24 16:31:06.571: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.572: psql pid 411237: DEBUG:  committing relation cache to cache storage
2025-11-24 16:31:06.572: psql pid 411237: DETAIL:  Query="SELECT pg_catalog.version()"
2025-11-24 16:31:06.572: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.572: psql pid 411237: DEBUG:  memcache encode key
2025-11-24 16:31:06.572: psql pid 411237: DETAIL:  username: "t-ishii" database_name: "test"
2025-11-24 16:31:06.572: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.573: psql pid 411237: DEBUG:  memcache encode key
2025-11-24 16:31:06.573: psql pid 411237: DETAIL:  query: "SELECT pg_catalog.version()"
2025-11-24 16:31:06.573: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.573: psql pid 411237: DEBUG:  memcache encode key
2025-11-24 16:31:06.573: psql pid 411237: DETAIL:  `t-ishiiSELECT pg_catalog.version()test' -> `c8645f9bdf015b6b5ee4667cb578f1b3'
2025-11-24 16:31:06.573: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.573: psql pid 411237: DEBUG:  committing relation cache to cache storage
2025-11-24 16:31:06.573: psql pid 411237: DETAIL:  memqcache_expire = 0
2025-11-24 16:31:06.573: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.573: psql pid 411237: DEBUG:  memcache adding item
2025-11-24 16:31:06.573: psql pid 411237: DETAIL:  new item inserted. blockid: 0 itemid:0
2025-11-24 16:31:06.573: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.573: psql pid 411237: DEBUG:  memcache adding item
2025-11-24 16:31:06.573: psql pid 411237: DETAIL:  block: 0 item: 0
2025-11-24 16:31:06.573: psql pid 411237: CONTEXT:  while searching system catalog, When relcache is missed
2025-11-24 16:31:06.573: psql pid 411237: DEBUG:  SimpleQuery
2025-11-24 16:31:06.573: psql pid 411237: DETAIL:  nodes reporting
2025-11-24 16:31:06.582: psql pid 411237: DEBUG:  authenticate kind = 0
2025-11-24 16:31:06.582: psql pid 411237: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:06.582: psql pid 411237: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:06.582: psql pid 411237: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-24 16:31:06.591: psql pid 411237: DEBUG:  authenticate kind = 0
2025-11-24 16:31:06.591: psql pid 411237: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:06.591: psql pid 411237: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:06.591: psql pid 411237: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-24 16:31:06.598: psql pid 411237: DEBUG:  authenticate kind = 0
2025-11-24 16:31:06.598: psql pid 411237: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:06.598: psql pid 411237: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:06.598: psql pid 411237: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-24 16:31:06.599: psql pid 411237: DEBUG:  decide where to send the query
2025-11-24 16:31:06.599: psql pid 411237: DETAIL:  destination = 3 for query= "DISCARD ALL"
2025-11-24 16:31:06.599: psql pid 411237: LOG:  DB node id: 0 backend pid: 411256 statement: DISCARD ALL
2025-11-24 16:31:06.599: psql pid 411237: DEBUG:  waiting for query response
2025-11-24 16:31:06.599: psql pid 411237: DETAIL:  waiting for backend:0 to complete the query
2025-11-24 16:31:06.600: psql pid 411237: LOG:  DB node id: 1 backend pid: 411257 statement: DISCARD ALL
2025-11-24 16:31:06.600: psql pid 411237: DEBUG:  waiting for query response
2025-11-24 16:31:06.600: psql pid 411237: DETAIL:  waiting for backend:1 to complete the query
2025-11-24 16:31:06.600: psql pid 411237: DEBUG:  setting backend connection close timer
2025-11-24 16:31:06.600: psql pid 411237: DETAIL:  close time 1763969466
2025-11-24 16:31:15.724: health_check2 pid 411245: DEBUG:  health check: clearing alarm
2025-11-24 16:31:15.724: health_check1 pid 411244: DEBUG:  health check: clearing alarm
2025-11-24 16:31:15.724: health_check0 pid 411243: DEBUG:  health check: clearing alarm
2025-11-24 16:31:15.729: health_check2 pid 411245: DEBUG:  authenticate kind = 0
2025-11-24 16:31:15.729: health_check2 pid 411245: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:15.729: health_check2 pid 411245: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:15.729: health_check2 pid 411245: DEBUG:  health check: clearing alarm
2025-11-24 16:31:15.729: health_check2 pid 411245: DEBUG:  health check: clearing alarm
2025-11-24 16:31:15.729: health_check2 pid 411245: WARNING:  check_backend_down_request: failed to open file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/backend_down_request
2025-11-24 16:31:15.729: health_check2 pid 411245: DETAIL:  No such file or directory
2025-11-24 16:31:15.729: health_check0 pid 411243: DEBUG:  authenticate kind = 0
2025-11-24 16:31:15.729: health_check0 pid 411243: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:15.729: health_check0 pid 411243: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:15.729: health_check0 pid 411243: DEBUG:  health check: clearing alarm
2025-11-24 16:31:15.729: health_check0 pid 411243: DEBUG:  health check: clearing alarm
2025-11-24 16:31:15.729: health_check1 pid 411244: DEBUG:  authenticate kind = 0
2025-11-24 16:31:15.729: health_check1 pid 411244: DEBUG:  authenticate backend: key data received
2025-11-24 16:31:15.729: health_check1 pid 411244: DEBUG:  authenticate backend: transaction state: I
2025-11-24 16:31:15.729: health_check1 pid 411244: DEBUG:  health check: clearing alarm
2025-11-24 16:31:15.729: health_check1 pid 411244: DEBUG:  health check: clearing alarm
2025-11-24 16:31:15.729: health_check0 pid 411243: WARNING:  check_backend_down_request: failed to open file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/backend_down_request
2025-11-24 16:31:15.729: health_check0 pid 411243: DETAIL:  No such file or directory
2025-11-24 16:31:15.729: health_check1 pid 411244: WARNING:  check_backend_down_request: failed to open file /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/041.external_replication_delay/testdir/log/backend_down_request
2025-11-24 16:31:15.729: health_check1 pid 411244: DETAIL:  No such file or directory
2025-11-24 16:31:20.738: sr_check_worker pid 411242: LOG:  Replication of node: 1 is behind 0.025 second(s) from the primary server (node: 0) [external command]
2025-11-24 16:31:20.738: sr_check_worker pid 411242: CONTEXT:  while checking replication time lag
2025-11-24 16:31:20.738: sr_check_worker pid 411242: LOG:  Replication of node: 2 is behind 0.050 second(s) from the primary server (node: 0) [external command]
2025-11-24 16:31:20.738: sr_check_worker pid 411242: CONTEXT:  while checking replication time lag
2025-11-24 16:31:20.738: sr_check_worker pid 411242: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-24 16:31:20.740: sr_check_worker pid 411242: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-24 16:31:20.741: sr_check_worker pid 411242: DEBUG:  do_query: extended:0 query:"SELECT pg_catalog.pg_is_in_recovery()"
2025-11-24 16:31:20.742: sr_check_worker pid 411242: DEBUG:  verify_backend_node_status: multiple standbys: 2
2025-11-24 16:31:20.742: sr_check_worker pid 411242: DEBUG:  verify_backend_node_status: detach_false_primary is off and no additional checking is performed
2025-11-24 16:31:20.742: sr_check_worker pid 411242: DEBUG:  node status[0]: 1
2025-11-24 16:31:20.742: sr_check_worker pid 411242: DEBUG:  node status[1]: 2
2025-11-24 16:31:20.742: sr_check_worker pid 411242: DEBUG:  node status[2]: 2
2025-11-24 16:31:20.742: sr_check_worker pid 411242: DEBUG:  pool_release_follow_primary_lock called
2025-11-24 16:31:21.696: main pid 411227: LOG:  exit handler called (signal: 2)
2025-11-24 16:31:21.696: main pid 411227: LOG:  shutting down by signal 2
2025-11-24 16:31:21.696: main pid 411227: LOG:  terminating all child processes
2025-11-24 16:31:21.699: main pid 411227: LOG:  Pgpool-II system is shutdown

^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-12-21 11:06  Nadav Shatz <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Nadav Shatz @ 2025-12-21 11:06 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

--00000000000005ddad0646744dfd
Content-Type: multipart/alternative; boundary="00000000000005ddac0646744dfb"

--00000000000005ddac0646744dfb
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

I think everything is passing now. new version attached.

On Mon, Nov 24, 2025 at 9:41=E2=80=AFAM Tatsuo Ishii <[email protected]>=
 wrote:

> Thank you for updating the patch! This time the patch applies without
> any issue and compiles fine. Unfortunately regression test failed.
>
> testing 041.external_replication_delay...failed.
>
> From the regression log, it seems Test7 failed.
>
> -------------------------------------------------------------------------=
-----
> =3D=3D=3D Test7: Command timeout handling =3D=3D=3D
> waiting for server to start....411181 2025-11-24 16:31:05.244 JST LOG:
> redirecting log output to logging collector process
> 411181 2025-11-24 16:31:05.244 JST HINT:  Future log output will appear i=
n
> directory "log".
>  done
> server started
> waiting for server to start....411196 2025-11-24 16:31:05.352 JST LOG:
> redirecting log output to logging collector process
> 411196 2025-11-24 16:31:05.352 JST HINT:  Future log output will appear i=
n
> directory "log".
>  done
> server started
> waiting for server to start....411213 2025-11-24 16:31:05.461 JST LOG:
> redirecting log output to logging collector process
> 411213 2025-11-24 16:31:05.461 JST HINT:  Future log output will appear i=
n
> directory "log".
>  done
> server started
> Waiting for command timeout...
> fail: command timeout not detected
>
> -------------------------------------------------------------------------=
-----
>
> Attached is the pgpool.log. If you need more info, please let me know.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
>
> > Hi Tatsuo,
> >
> > Sorry again, this was due to the separation of 2 patches and i only sen=
t
> > the one.
> >
> > I've merged it into 1 commit and 1 patch and rebased over master to avo=
id
> > these issues moving forward.
> >
> > PFA latest version
> >
> > On Thu, Nov 20, 2025 at 1:09=E2=80=AFAM Tatsuo Ishii <ishii@postgresql.=
org>
> wrote:
> >
> >> Hi Nadav,
> >>
> >> Thank you for new patch.
> >> Unfortunately the patch did not apply to current master.
> >>
> >> $ git apply
> >> ~/0001-Fix-multiple-issues-in-external-replication-delay-fe.patch
> >> error: patch failed: src/streaming_replication/pool_worker_child.c:694
> >> error: src/streaming_replication/pool_worker_child.c: patch does not
> apply
> >>
> >> Maybe the patch is on top of your previous patch?
> >>
> >> Also I suggest to use "-v" option of "git format-patch" to add the
> >> patch version number so that we can easily know which patch is the
> >> latest.
> >>
> >> Best regards,
> >> --
> >> Tatsuo Ishii
> >> SRA OSS K.K.
> >> English: http://www.sraoss.co.jp/index_en/
> >> Japanese:http://www.sraoss.co.jp
> >>
> >> > Hi Tatsuo,
> >> >
> >> > Please see attached an updated version.
> >> >
> >> > thank you
> >> >
> >> > On Fri, Nov 7, 2025 at 2:07=E2=80=AFAM Tatsuo Ishii <ishii@postgresq=
l.org>
> >> wrote:
> >> >
> >> >> > Sorry for that - thanks for the patch.
> >> >> >
> >> >> > Please find attached a new version
> >> >>
> >> >> Thanks for the new version. Unfortunately this time regression test
> >> >> fails at:
> >> >>
> >> >> > Waiting for command timeout...
> >> >> > fail: command timeout not detected
> >> >>
> >> >> Attached is the pgpool.log.
> >> >>
> >> >> Best regards,
> >> >> --
> >> >> Tatsuo Ishii
> >> >> SRA OSS K.K.
> >> >> English: http://www.sraoss.co.jp/index_en/
> >> >> Japanese:http://www.sraoss.co.jp
> >> >>
> >> >> > On Mon, Nov 3, 2025 at 9:05=E2=80=AFAM Tatsuo Ishii <ishii@postgr=
esql.org>
> >> >> wrote:
> >> >> >
> >> >> >> > thanks and sorry for the issues, please find attached updated
> >> version.
> >> >> >>
> >> >> >> No problem.
> >> >> >>
> >> >> >> This time the patch applies fine, no compiler warnings.  However=
,
> >> >> >> regression test did not passed here (on Ubuntu 24 LTS if this
> >> >> >> matters).  So I looked into
> >> >> >> src/test/regression/tests/041.external_replication_delay/test.sh=
 a
> >> >> >> little bit and apply attached patch (test.sh.patch). It moved
> forward
> >> >> >> partially but failed at:
> >> >> >>
> >> >> >> fail: command execution failure not detected
> >> >> >>
> >> >> >> Please find attached
> >> >> >>
> >> >>
> >>
> src/test/regression/tests/041.external_replication_delay/testdir/pgpool.l=
og
> >> >> >> and src/test/regression/log/041.external_replication_delay.
> >> >> >>
> >> >> >> Best regards,
> >> >> >> --
> >> >> >> Tatsuo Ishii
> >> >> >> SRA OSS K.K.
> >> >> >> English: http://www.sraoss.co.jp/index_en/
> >> >> >> Japanese:http://www.sraoss.co.jp
> >> >> >>
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Nadav Shatz
> >> >> > Tailor Brands | CTO
> >> >>
> >> >
> >> >
> >> > --
> >> > Nadav Shatz
> >> > Tailor Brands | CTO
> >>
> >
> >
> > --
> > Nadav Shatz
> > Tailor Brands | CTO
>


--=20
Nadav Shatz
Tailor Brands | CTO

--00000000000005ddac0646744dfb
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I think everything=C2=A0is passing now. new version attach=
ed.</div><br><div class=3D"gmail_quote gmail_quote_container"><div dir=3D"l=
tr" class=3D"gmail_attr">On Mon, Nov 24, 2025 at 9:41=E2=80=AFAM Tatsuo Ish=
ii &lt;<a href=3D"mailto:[email protected]">[email protected]</a>&gt;=
 wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px =
0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Thank yo=
u for updating the patch! This time the patch applies without<br>
any issue and compiles fine. Unfortunately regression test failed.<br>
<br>
testing 041.external_replication_delay...failed.<br>
<br>


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-12-23 00:13  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2025-12-23 00:13 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

> I think everything is passing now. new version attached.

Unfortunately Test1 did not pass.

=== Test1: Basic external command with integer millisecond values ===
waiting for server to start....1438600 2025-12-23 09:09:48.337 JST LOG:  redirecting log output to logging collector process
1438600 2025-12-23 09:09:48.337 JST HINT:  Future log output will appear in directory "log".
 done
server started
waiting for server to start....1438617 2025-12-23 09:09:48.443 JST LOG:  redirecting log output to logging collector process
1438617 2025-12-23 09:09:48.443 JST HINT:  Future log output will appear in directory "log".
 done
server started
waiting for server to start....1438634 2025-12-23 09:09:48.561 JST LOG:  redirecting log output to logging collector process
1438634 2025-12-23 09:09:48.561 JST HINT:  Future log output will appear in directory "log".
 done
server started
CREATE TABLE
Waiting for sr_check to run...
Command executed after 1 seconds
 node_id | hostname  | port  | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change  
---------+-----------+-------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | localhost | 11002 | up     | up        | 0.333333  | primary | primary | 0          | true              | 0                 |                   |                        | 2025-12-23 09:09:49
 1       | localhost | 11003 | up     | up        | 0.333333  | standby | standby | 0          | false             | 0                 |                   |                        | 2025-12-23 09:09:49
 2       | localhost | 11004 | up     | up        | 0.333333  | standby | standby | 0          | false             | 0                 |                   |                        | 2025-12-23 09:09:49
(3 rows)

fail: external command delay logging not found

> On Mon, Nov 24, 2025 at 9:41 AM Tatsuo Ishii <[email protected]> wrote:
> 
>> Thank you for updating the patch! This time the patch applies without
>> any issue and compiles fine. Unfortunately regression test failed.
>>
>> testing 041.external_replication_delay...failed.
>>
>> From the regression log, it seems Test7 failed.
>>
>> ------------------------------------------------------------------------------
>> === Test7: Command timeout handling ===
>> waiting for server to start....411181 2025-11-24 16:31:05.244 JST LOG:
>> redirecting log output to logging collector process
>> 411181 2025-11-24 16:31:05.244 JST HINT:  Future log output will appear in
>> directory "log".
>>  done
>> server started
>> waiting for server to start....411196 2025-11-24 16:31:05.352 JST LOG:
>> redirecting log output to logging collector process
>> 411196 2025-11-24 16:31:05.352 JST HINT:  Future log output will appear in
>> directory "log".
>>  done
>> server started
>> waiting for server to start....411213 2025-11-24 16:31:05.461 JST LOG:
>> redirecting log output to logging collector process
>> 411213 2025-11-24 16:31:05.461 JST HINT:  Future log output will appear in
>> directory "log".
>>  done
>> server started
>> Waiting for command timeout...
>> fail: command timeout not detected
>>
>> ------------------------------------------------------------------------------
>>
>> Attached is the pgpool.log. If you need more info, please let me know.
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
>>
>> > Hi Tatsuo,
>> >
>> > Sorry again, this was due to the separation of 2 patches and i only sent
>> > the one.
>> >
>> > I've merged it into 1 commit and 1 patch and rebased over master to avoid
>> > these issues moving forward.
>> >
>> > PFA latest version
>> >
>> > On Thu, Nov 20, 2025 at 1:09 AM Tatsuo Ishii <[email protected]>
>> wrote:
>> >
>> >> Hi Nadav,
>> >>
>> >> Thank you for new patch.
>> >> Unfortunately the patch did not apply to current master.
>> >>
>> >> $ git apply
>> >> ~/0001-Fix-multiple-issues-in-external-replication-delay-fe.patch
>> >> error: patch failed: src/streaming_replication/pool_worker_child.c:694
>> >> error: src/streaming_replication/pool_worker_child.c: patch does not
>> apply
>> >>
>> >> Maybe the patch is on top of your previous patch?
>> >>
>> >> Also I suggest to use "-v" option of "git format-patch" to add the
>> >> patch version number so that we can easily know which patch is the
>> >> latest.
>> >>
>> >> Best regards,
>> >> --
>> >> Tatsuo Ishii
>> >> SRA OSS K.K.
>> >> English: http://www.sraoss.co.jp/index_en/
>> >> Japanese:http://www.sraoss.co.jp
>> >>
>> >> > Hi Tatsuo,
>> >> >
>> >> > Please see attached an updated version.
>> >> >
>> >> > thank you
>> >> >
>> >> > On Fri, Nov 7, 2025 at 2:07 AM Tatsuo Ishii <[email protected]>
>> >> wrote:
>> >> >
>> >> >> > Sorry for that - thanks for the patch.
>> >> >> >
>> >> >> > Please find attached a new version
>> >> >>
>> >> >> Thanks for the new version. Unfortunately this time regression test
>> >> >> fails at:
>> >> >>
>> >> >> > Waiting for command timeout...
>> >> >> > fail: command timeout not detected
>> >> >>
>> >> >> Attached is the pgpool.log.
>> >> >>
>> >> >> Best regards,
>> >> >> --
>> >> >> Tatsuo Ishii
>> >> >> SRA OSS K.K.
>> >> >> English: http://www.sraoss.co.jp/index_en/
>> >> >> Japanese:http://www.sraoss.co.jp
>> >> >>
>> >> >> > On Mon, Nov 3, 2025 at 9:05 AM Tatsuo Ishii <[email protected]>
>> >> >> wrote:
>> >> >> >
>> >> >> >> > thanks and sorry for the issues, please find attached updated
>> >> version.
>> >> >> >>
>> >> >> >> No problem.
>> >> >> >>
>> >> >> >> This time the patch applies fine, no compiler warnings.  However,
>> >> >> >> regression test did not passed here (on Ubuntu 24 LTS if this
>> >> >> >> matters).  So I looked into
>> >> >> >> src/test/regression/tests/041.external_replication_delay/test.sh a
>> >> >> >> little bit and apply attached patch (test.sh.patch). It moved
>> forward
>> >> >> >> partially but failed at:
>> >> >> >>
>> >> >> >> fail: command execution failure not detected
>> >> >> >>
>> >> >> >> Please find attached
>> >> >> >>
>> >> >>
>> >>
>> src/test/regression/tests/041.external_replication_delay/testdir/pgpool.log
>> >> >> >> and src/test/regression/log/041.external_replication_delay.
>> >> >> >>
>> >> >> >> Best regards,
>> >> >> >> --
>> >> >> >> Tatsuo Ishii
>> >> >> >> SRA OSS K.K.
>> >> >> >> English: http://www.sraoss.co.jp/index_en/
>> >> >> >> Japanese:http://www.sraoss.co.jp
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Nadav Shatz
>> >> >> > Tailor Brands | CTO
>> >> >>
>> >> >
>> >> >
>> >> > --
>> >> > Nadav Shatz
>> >> > Tailor Brands | CTO
>> >>
>> >
>> >
>> > --
>> > Nadav Shatz
>> > Tailor Brands | CTO
>>
> 
> 
> -- 
> Nadav Shatz
> Tailor Brands | CTO


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-12-23 06:28  Nadav Shatz <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Nadav Shatz @ 2025-12-23 06:28 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

Hi Tatsuo,

I'km running into issues testing this and have created a full docker
compose setup - can you please point me to up to date guides on the best
way to run the tests so i know we're doing it the same way?

Thank you for all your help!

On Tue, Dec 23, 2025 at 2:13 AM Tatsuo Ishii <[email protected]> wrote:

> > I think everything is passing now. new version attached.
>
> Unfortunately Test1 did not pass.
>
> === Test1: Basic external command with integer millisecond values ===
> waiting for server to start....1438600 2025-12-23 09:09:48.337 JST LOG:
> redirecting log output to logging collector process
> 1438600 2025-12-23 09:09:48.337 JST HINT:  Future log output will appear
> in directory "log".
>  done
> server started
> waiting for server to start....1438617 2025-12-23 09:09:48.443 JST LOG:
> redirecting log output to logging collector process
> 1438617 2025-12-23 09:09:48.443 JST HINT:  Future log output will appear
> in directory "log".
>  done
> server started
> waiting for server to start....1438634 2025-12-23 09:09:48.561 JST LOG:
> redirecting log output to logging collector process
> 1438634 2025-12-23 09:09:48.561 JST HINT:  Future log output will appear
> in directory "log".
>  done
> server started
> CREATE TABLE
> Waiting for sr_check to run...
> Command executed after 1 seconds
>  node_id | hostname  | port  | status | pg_status | lb_weight |  role   |
> pg_role | select_cnt | load_balance_node | replication_delay |
> replication_state | replication_sync_state | last_status_change
>
> ---------+-----------+-------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
>  0       | localhost | 11002 | up     | up        | 0.333333  | primary |
> primary | 0          | true              | 0                 |
>      |                        | 2025-12-23 09:09:49
>  1       | localhost | 11003 | up     | up        | 0.333333  | standby |
> standby | 0          | false             | 0                 |
>      |                        | 2025-12-23 09:09:49
>  2       | localhost | 11004 | up     | up        | 0.333333  | standby |
> standby | 0          | false             | 0                 |
>      |                        | 2025-12-23 09:09:49
> (3 rows)
>
> fail: external command delay logging not found
>
> > On Mon, Nov 24, 2025 at 9:41 AM Tatsuo Ishii <[email protected]>
> wrote:
> >
> >> Thank you for updating the patch! This time the patch applies without
> >> any issue and compiles fine. Unfortunately regression test failed.
> >>
> >> testing 041.external_replication_delay...failed.
> >>
> >> From the regression log, it seems Test7 failed.
> >>
> >>
> ------------------------------------------------------------------------------
> >> === Test7: Command timeout handling ===
> >> waiting for server to start....411181 2025-11-24 16:31:05.244 JST LOG:
> >> redirecting log output to logging collector process
> >> 411181 2025-11-24 16:31:05.244 JST HINT:  Future log output will appear
> in
> >> directory "log".
> >>  done
> >> server started
> >> waiting for server to start....411196 2025-11-24 16:31:05.352 JST LOG:
> >> redirecting log output to logging collector process
> >> 411196 2025-11-24 16:31:05.352 JST HINT:  Future log output will appear
> in
> >> directory "log".
> >>  done
> >> server started
> >> waiting for server to start....411213 2025-11-24 16:31:05.461 JST LOG:
> >> redirecting log output to logging collector process
> >> 411213 2025-11-24 16:31:05.461 JST HINT:  Future log output will appear
> in
> >> directory "log".
> >>  done
> >> server started
> >> Waiting for command timeout...
> >> fail: command timeout not detected
> >>
> >>
> ------------------------------------------------------------------------------
> >>
> >> Attached is the pgpool.log. If you need more info, please let me know.
> >>
> >> Best regards,
> >> --
> >> Tatsuo Ishii
> >> SRA OSS K.K.
> >> English: http://www.sraoss.co.jp/index_en/
> >> Japanese:http://www.sraoss.co.jp
> >>
> >>
> >> > Hi Tatsuo,
> >> >
> >> > Sorry again, this was due to the separation of 2 patches and i only
> sent
> >> > the one.
> >> >
> >> > I've merged it into 1 commit and 1 patch and rebased over master to
> avoid
> >> > these issues moving forward.
> >> >
> >> > PFA latest version
> >> >
> >> > On Thu, Nov 20, 2025 at 1:09 AM Tatsuo Ishii <[email protected]>
> >> wrote:
> >> >
> >> >> Hi Nadav,
> >> >>
> >> >> Thank you for new patch.
> >> >> Unfortunately the patch did not apply to current master.
> >> >>
> >> >> $ git apply
> >> >> ~/0001-Fix-multiple-issues-in-external-replication-delay-fe.patch
> >> >> error: patch failed:
> src/streaming_replication/pool_worker_child.c:694
> >> >> error: src/streaming_replication/pool_worker_child.c: patch does not
> >> apply
> >> >>
> >> >> Maybe the patch is on top of your previous patch?
> >> >>
> >> >> Also I suggest to use "-v" option of "git format-patch" to add the
> >> >> patch version number so that we can easily know which patch is the
> >> >> latest.
> >> >>
> >> >> Best regards,
> >> >> --
> >> >> Tatsuo Ishii
> >> >> SRA OSS K.K.
> >> >> English: http://www.sraoss.co.jp/index_en/
> >> >> Japanese:http://www.sraoss.co.jp
> >> >>
> >> >> > Hi Tatsuo,
> >> >> >
> >> >> > Please see attached an updated version.
> >> >> >
> >> >> > thank you
> >> >> >
> >> >> > On Fri, Nov 7, 2025 at 2:07 AM Tatsuo Ishii <[email protected]>
> >> >> wrote:
> >> >> >
> >> >> >> > Sorry for that - thanks for the patch.
> >> >> >> >
> >> >> >> > Please find attached a new version
> >> >> >>
> >> >> >> Thanks for the new version. Unfortunately this time regression
> test
> >> >> >> fails at:
> >> >> >>
> >> >> >> > Waiting for command timeout...
> >> >> >> > fail: command timeout not detected
> >> >> >>
> >> >> >> Attached is the pgpool.log.
> >> >> >>
> >> >> >> Best regards,
> >> >> >> --
> >> >> >> Tatsuo Ishii
> >> >> >> SRA OSS K.K.
> >> >> >> English: http://www.sraoss.co.jp/index_en/
> >> >> >> Japanese:http://www.sraoss.co.jp
> >> >> >>
> >> >> >> > On Mon, Nov 3, 2025 at 9:05 AM Tatsuo Ishii <
> [email protected]>
> >> >> >> wrote:
> >> >> >> >
> >> >> >> >> > thanks and sorry for the issues, please find attached updated
> >> >> version.
> >> >> >> >>
> >> >> >> >> No problem.
> >> >> >> >>
> >> >> >> >> This time the patch applies fine, no compiler warnings.
> However,
> >> >> >> >> regression test did not passed here (on Ubuntu 24 LTS if this
> >> >> >> >> matters).  So I looked into
> >> >> >> >>
> src/test/regression/tests/041.external_replication_delay/test.sh a
> >> >> >> >> little bit and apply attached patch (test.sh.patch). It moved
> >> forward
> >> >> >> >> partially but failed at:
> >> >> >> >>
> >> >> >> >> fail: command execution failure not detected
> >> >> >> >>
> >> >> >> >> Please find attached
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> src/test/regression/tests/041.external_replication_delay/testdir/pgpool.log
> >> >> >> >> and src/test/regression/log/041.external_replication_delay.
> >> >> >> >>
> >> >> >> >> Best regards,
> >> >> >> >> --
> >> >> >> >> Tatsuo Ishii
> >> >> >> >> SRA OSS K.K.
> >> >> >> >> English: http://www.sraoss.co.jp/index_en/
> >> >> >> >> Japanese:http://www.sraoss.co.jp
> >> >> >> >>
> >> >> >> >
> >> >> >> >
> >> >> >> > --
> >> >> >> > Nadav Shatz
> >> >> >> > Tailor Brands | CTO
> >> >> >>
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Nadav Shatz
> >> >> > Tailor Brands | CTO
> >> >>
> >> >
> >> >
> >> > --
> >> > Nadav Shatz
> >> > Tailor Brands | CTO
> >>
> >
> >
> > --
> > Nadav Shatz
> > Tailor Brands | CTO
>


-- 
Nadav Shatz
Tailor Brands | CTO


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-12-23 08:46  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2025-12-23 08:46 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

> Hi Tatsuo,
> 
> I'km running into issues testing this and have created a full docker
> compose setup - can you please point me to up to date guides on the best
> way to run the tests so i know we're doing it the same way?
> 
> Thank you for all your help!

I have run the regression test on the Pgpool-II master branch on my
Ubuntu 24 box.

cd pgpool2/src/test/regression
./regress.sh 041

This time I noticed:

- The patch does not named with version number
- The patch creates .dockerignore and docker/ directory.

Are they intended? I am asking because they are different from the
previous version.

> On Tue, Dec 23, 2025 at 2:13 AM Tatsuo Ishii <[email protected]> wrote:
> 
>> > I think everything is passing now. new version attached.
>>
>> Unfortunately Test1 did not pass.
>>
>> === Test1: Basic external command with integer millisecond values ===
>> waiting for server to start....1438600 2025-12-23 09:09:48.337 JST LOG:
>> redirecting log output to logging collector process
>> 1438600 2025-12-23 09:09:48.337 JST HINT:  Future log output will appear
>> in directory "log".
>>  done
>> server started
>> waiting for server to start....1438617 2025-12-23 09:09:48.443 JST LOG:
>> redirecting log output to logging collector process
>> 1438617 2025-12-23 09:09:48.443 JST HINT:  Future log output will appear
>> in directory "log".
>>  done
>> server started
>> waiting for server to start....1438634 2025-12-23 09:09:48.561 JST LOG:
>> redirecting log output to logging collector process
>> 1438634 2025-12-23 09:09:48.561 JST HINT:  Future log output will appear
>> in directory "log".
>>  done
>> server started
>> CREATE TABLE
>> Waiting for sr_check to run...
>> Command executed after 1 seconds
>>  node_id | hostname  | port  | status | pg_status | lb_weight |  role   |
>> pg_role | select_cnt | load_balance_node | replication_delay |
>> replication_state | replication_sync_state | last_status_change
>>
>> ---------+-----------+-------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
>>  0       | localhost | 11002 | up     | up        | 0.333333  | primary |
>> primary | 0          | true              | 0                 |
>>      |                        | 2025-12-23 09:09:49
>>  1       | localhost | 11003 | up     | up        | 0.333333  | standby |
>> standby | 0          | false             | 0                 |
>>      |                        | 2025-12-23 09:09:49
>>  2       | localhost | 11004 | up     | up        | 0.333333  | standby |
>> standby | 0          | false             | 0                 |
>>      |                        | 2025-12-23 09:09:49
>> (3 rows)
>>
>> fail: external command delay logging not found
>>
>> > On Mon, Nov 24, 2025 at 9:41 AM Tatsuo Ishii <[email protected]>
>> wrote:
>> >
>> >> Thank you for updating the patch! This time the patch applies without
>> >> any issue and compiles fine. Unfortunately regression test failed.
>> >>
>> >> testing 041.external_replication_delay...failed.
>> >>
>> >> From the regression log, it seems Test7 failed.
>> >>
>> >>
>> ------------------------------------------------------------------------------
>> >> === Test7: Command timeout handling ===
>> >> waiting for server to start....411181 2025-11-24 16:31:05.244 JST LOG:
>> >> redirecting log output to logging collector process
>> >> 411181 2025-11-24 16:31:05.244 JST HINT:  Future log output will appear
>> in
>> >> directory "log".
>> >>  done
>> >> server started
>> >> waiting for server to start....411196 2025-11-24 16:31:05.352 JST LOG:
>> >> redirecting log output to logging collector process
>> >> 411196 2025-11-24 16:31:05.352 JST HINT:  Future log output will appear
>> in
>> >> directory "log".
>> >>  done
>> >> server started
>> >> waiting for server to start....411213 2025-11-24 16:31:05.461 JST LOG:
>> >> redirecting log output to logging collector process
>> >> 411213 2025-11-24 16:31:05.461 JST HINT:  Future log output will appear
>> in
>> >> directory "log".
>> >>  done
>> >> server started
>> >> Waiting for command timeout...
>> >> fail: command timeout not detected
>> >>
>> >>
>> ------------------------------------------------------------------------------
>> >>
>> >> Attached is the pgpool.log. If you need more info, please let me know.
>> >>
>> >> Best regards,
>> >> --
>> >> Tatsuo Ishii
>> >> SRA OSS K.K.
>> >> English: http://www.sraoss.co.jp/index_en/
>> >> Japanese:http://www.sraoss.co.jp
>> >>
>> >>
>> >> > Hi Tatsuo,
>> >> >
>> >> > Sorry again, this was due to the separation of 2 patches and i only
>> sent
>> >> > the one.
>> >> >
>> >> > I've merged it into 1 commit and 1 patch and rebased over master to
>> avoid
>> >> > these issues moving forward.
>> >> >
>> >> > PFA latest version
>> >> >
>> >> > On Thu, Nov 20, 2025 at 1:09 AM Tatsuo Ishii <[email protected]>
>> >> wrote:
>> >> >
>> >> >> Hi Nadav,
>> >> >>
>> >> >> Thank you for new patch.
>> >> >> Unfortunately the patch did not apply to current master.
>> >> >>
>> >> >> $ git apply
>> >> >> ~/0001-Fix-multiple-issues-in-external-replication-delay-fe.patch
>> >> >> error: patch failed:
>> src/streaming_replication/pool_worker_child.c:694
>> >> >> error: src/streaming_replication/pool_worker_child.c: patch does not
>> >> apply
>> >> >>
>> >> >> Maybe the patch is on top of your previous patch?
>> >> >>
>> >> >> Also I suggest to use "-v" option of "git format-patch" to add the
>> >> >> patch version number so that we can easily know which patch is the
>> >> >> latest.
>> >> >>
>> >> >> Best regards,
>> >> >> --
>> >> >> Tatsuo Ishii
>> >> >> SRA OSS K.K.
>> >> >> English: http://www.sraoss.co.jp/index_en/
>> >> >> Japanese:http://www.sraoss.co.jp
>> >> >>
>> >> >> > Hi Tatsuo,
>> >> >> >
>> >> >> > Please see attached an updated version.
>> >> >> >
>> >> >> > thank you
>> >> >> >
>> >> >> > On Fri, Nov 7, 2025 at 2:07 AM Tatsuo Ishii <[email protected]>
>> >> >> wrote:
>> >> >> >
>> >> >> >> > Sorry for that - thanks for the patch.
>> >> >> >> >
>> >> >> >> > Please find attached a new version
>> >> >> >>
>> >> >> >> Thanks for the new version. Unfortunately this time regression
>> test
>> >> >> >> fails at:
>> >> >> >>
>> >> >> >> > Waiting for command timeout...
>> >> >> >> > fail: command timeout not detected
>> >> >> >>
>> >> >> >> Attached is the pgpool.log.
>> >> >> >>
>> >> >> >> Best regards,
>> >> >> >> --
>> >> >> >> Tatsuo Ishii
>> >> >> >> SRA OSS K.K.
>> >> >> >> English: http://www.sraoss.co.jp/index_en/
>> >> >> >> Japanese:http://www.sraoss.co.jp
>> >> >> >>
>> >> >> >> > On Mon, Nov 3, 2025 at 9:05 AM Tatsuo Ishii <
>> [email protected]>
>> >> >> >> wrote:
>> >> >> >> >
>> >> >> >> >> > thanks and sorry for the issues, please find attached updated
>> >> >> version.
>> >> >> >> >>
>> >> >> >> >> No problem.
>> >> >> >> >>
>> >> >> >> >> This time the patch applies fine, no compiler warnings.
>> However,
>> >> >> >> >> regression test did not passed here (on Ubuntu 24 LTS if this
>> >> >> >> >> matters).  So I looked into
>> >> >> >> >>
>> src/test/regression/tests/041.external_replication_delay/test.sh a
>> >> >> >> >> little bit and apply attached patch (test.sh.patch). It moved
>> >> forward
>> >> >> >> >> partially but failed at:
>> >> >> >> >>
>> >> >> >> >> fail: command execution failure not detected
>> >> >> >> >>
>> >> >> >> >> Please find attached
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> src/test/regression/tests/041.external_replication_delay/testdir/pgpool.log
>> >> >> >> >> and src/test/regression/log/041.external_replication_delay.
>> >> >> >> >>
>> >> >> >> >> Best regards,
>> >> >> >> >> --
>> >> >> >> >> Tatsuo Ishii
>> >> >> >> >> SRA OSS K.K.
>> >> >> >> >> English: http://www.sraoss.co.jp/index_en/
>> >> >> >> >> Japanese:http://www.sraoss.co.jp
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > --
>> >> >> >> > Nadav Shatz
>> >> >> >> > Tailor Brands | CTO
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Nadav Shatz
>> >> >> > Tailor Brands | CTO
>> >> >>
>> >> >
>> >> >
>> >> > --
>> >> > Nadav Shatz
>> >> > Tailor Brands | CTO
>> >>
>> >
>> >
>> > --
>> > Nadav Shatz
>> > Tailor Brands | CTO
>>
> 
> 
> -- 
> Nadav Shatz
> Tailor Brands | CTO


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-12-23 14:03  Nadav Shatz <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 3 replies; 61+ messages in thread

From: Nadav Shatz @ 2025-12-23 14:03 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

Hi Tatsuo,

Thank you for the note.

I've removed the docker stuff. started working in an ubuntu 24 VM to match
the setup. hopefully the results will be better, had so many issues
compiling and testing before that stuff wasn't properly formulated.

Attaching the latest patch.

this is what i'm seeing:
adav@lima-dev:/src/pgpool2/src/test/regression$ PGHOST=/tmp ./regress.sh -p
/usr/bin 041.external_replication_delay
creating pgpool-II temporary installation ...
moving pgpool_setup to temporary installation path ...
moving watchdog_setup to temporary installation path ...
using pgpool-II at /src/pgpool2/src/test/regression/temp/installed
*************************
REGRESSION MODE          : install
Pgpool-II version        : pgpool-II version 4.8devel (mitsukakeboshi)
Pgpool-II install path   : /src/pgpool2/src/test/regression/temp/installed
PostgreSQL bin           : /usr/lib/postgresql/16/bin
PostgreSQL Major version : 16
pgbench                  : /usr/lib/postgresql/16/bin/pgbench
PostgreSQL jdbc          :
/usr/local/pgsql/share/postgresql-9.2-1003.jdbc4.jar
*************************
testing 041.external_replication_delay...ok.
out of 1 ok:1 failed:0 timeout:0



On Tue, Dec 23, 2025 at 10:46 AM Tatsuo Ishii <[email protected]> wrote:

> > Hi Tatsuo,
> >
> > I'km running into issues testing this and have created a full docker
> > compose setup - can you please point me to up to date guides on the best
> > way to run the tests so i know we're doing it the same way?
> >
> > Thank you for all your help!
>
> I have run the regression test on the Pgpool-II master branch on my
> Ubuntu 24 box.
>
> cd pgpool2/src/test/regression
> ./regress.sh 041
>
> This time I noticed:
>
> - The patch does not named with version number
> - The patch creates .dockerignore and docker/ directory.
>
> Are they intended? I am asking because they are different from the
> previous version.
>
> > On Tue, Dec 23, 2025 at 2:13 AM Tatsuo Ishii <[email protected]>
> wrote:
> >
> >> > I think everything is passing now. new version attached.
> >>
> >> Unfortunately Test1 did not pass.
> >>
> >> === Test1: Basic external command with integer millisecond values ===
> >> waiting for server to start....1438600 2025-12-23 09:09:48.337 JST LOG:
> >> redirecting log output to logging collector process
> >> 1438600 2025-12-23 09:09:48.337 JST HINT:  Future log output will appear
> >> in directory "log".
> >>  done
> >> server started
> >> waiting for server to start....1438617 2025-12-23 09:09:48.443 JST LOG:
> >> redirecting log output to logging collector process
> >> 1438617 2025-12-23 09:09:48.443 JST HINT:  Future log output will appear
> >> in directory "log".
> >>  done
> >> server started
> >> waiting for server to start....1438634 2025-12-23 09:09:48.561 JST LOG:
> >> redirecting log output to logging collector process
> >> 1438634 2025-12-23 09:09:48.561 JST HINT:  Future log output will appear
> >> in directory "log".
> >>  done
> >> server started
> >> CREATE TABLE
> >> Waiting for sr_check to run...
> >> Command executed after 1 seconds
> >>  node_id | hostname  | port  | status | pg_status | lb_weight |  role
>  |
> >> pg_role | select_cnt | load_balance_node | replication_delay |
> >> replication_state | replication_sync_state | last_status_change
> >>
> >>
> ---------+-----------+-------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
> >>  0       | localhost | 11002 | up     | up        | 0.333333  | primary
> |
> >> primary | 0          | true              | 0                 |
> >>      |                        | 2025-12-23 09:09:49
> >>  1       | localhost | 11003 | up     | up        | 0.333333  | standby
> |
> >> standby | 0          | false             | 0                 |
> >>      |                        | 2025-12-23 09:09:49
> >>  2       | localhost | 11004 | up     | up        | 0.333333  | standby
> |
> >> standby | 0          | false             | 0                 |
> >>      |                        | 2025-12-23 09:09:49
> >> (3 rows)
> >>
> >> fail: external command delay logging not found
> >>
> >> > On Mon, Nov 24, 2025 at 9:41 AM Tatsuo Ishii <[email protected]>
> >> wrote:
> >> >
> >> >> Thank you for updating the patch! This time the patch applies without
> >> >> any issue and compiles fine. Unfortunately regression test failed.
> >> >>
> >> >> testing 041.external_replication_delay...failed.
> >> >>
> >> >> From the regression log, it seems Test7 failed.
> >> >>
> >> >>
> >>
> ------------------------------------------------------------------------------
> >> >> === Test7: Command timeout handling ===
> >> >> waiting for server to start....411181 2025-11-24 16:31:05.244 JST
> LOG:
> >> >> redirecting log output to logging collector process
> >> >> 411181 2025-11-24 16:31:05.244 JST HINT:  Future log output will
> appear
> >> in
> >> >> directory "log".
> >> >>  done
> >> >> server started
> >> >> waiting for server to start....411196 2025-11-24 16:31:05.352 JST
> LOG:
> >> >> redirecting log output to logging collector process
> >> >> 411196 2025-11-24 16:31:05.352 JST HINT:  Future log output will
> appear
> >> in
> >> >> directory "log".
> >> >>  done
> >> >> server started
> >> >> waiting for server to start....411213 2025-11-24 16:31:05.461 JST
> LOG:
> >> >> redirecting log output to logging collector process
> >> >> 411213 2025-11-24 16:31:05.461 JST HINT:  Future log output will
> appear
> >> in
> >> >> directory "log".
> >> >>  done
> >> >> server started
> >> >> Waiting for command timeout...
> >> >> fail: command timeout not detected
> >> >>
> >> >>
> >>
> ------------------------------------------------------------------------------
> >> >>
> >> >> Attached is the pgpool.log. If you need more info, please let me
> know.
> >> >>
> >> >> Best regards,
> >> >> --
> >> >> Tatsuo Ishii
> >> >> SRA OSS K.K.
> >> >> English: http://www.sraoss.co.jp/index_en/
> >> >> Japanese:http://www.sraoss.co.jp
> >> >>
> >> >>
> >> >> > Hi Tatsuo,
> >> >> >
> >> >> > Sorry again, this was due to the separation of 2 patches and i only
> >> sent
> >> >> > the one.
> >> >> >
> >> >> > I've merged it into 1 commit and 1 patch and rebased over master to
> >> avoid
> >> >> > these issues moving forward.
> >> >> >
> >> >> > PFA latest version
> >> >> >
> >> >> > On Thu, Nov 20, 2025 at 1:09 AM Tatsuo Ishii <[email protected]
> >
> >> >> wrote:
> >> >> >
> >> >> >> Hi Nadav,
> >> >> >>
> >> >> >> Thank you for new patch.
> >> >> >> Unfortunately the patch did not apply to current master.
> >> >> >>
> >> >> >> $ git apply
> >> >> >> ~/0001-Fix-multiple-issues-in-external-replication-delay-fe.patch
> >> >> >> error: patch failed:
> >> src/streaming_replication/pool_worker_child.c:694
> >> >> >> error: src/streaming_replication/pool_worker_child.c: patch does
> not
> >> >> apply
> >> >> >>
> >> >> >> Maybe the patch is on top of your previous patch?
> >> >> >>
> >> >> >> Also I suggest to use "-v" option of "git format-patch" to add the
> >> >> >> patch version number so that we can easily know which patch is the
> >> >> >> latest.
> >> >> >>
> >> >> >> Best regards,
> >> >> >> --
> >> >> >> Tatsuo Ishii
> >> >> >> SRA OSS K.K.
> >> >> >> English: http://www.sraoss.co.jp/index_en/
> >> >> >> Japanese:http://www.sraoss.co.jp
> >> >> >>
> >> >> >> > Hi Tatsuo,
> >> >> >> >
> >> >> >> > Please see attached an updated version.
> >> >> >> >
> >> >> >> > thank you
> >> >> >> >
> >> >> >> > On Fri, Nov 7, 2025 at 2:07 AM Tatsuo Ishii <
> [email protected]>
> >> >> >> wrote:
> >> >> >> >
> >> >> >> >> > Sorry for that - thanks for the patch.
> >> >> >> >> >
> >> >> >> >> > Please find attached a new version
> >> >> >> >>
> >> >> >> >> Thanks for the new version. Unfortunately this time regression
> >> test
> >> >> >> >> fails at:
> >> >> >> >>
> >> >> >> >> > Waiting for command timeout...
> >> >> >> >> > fail: command timeout not detected
> >> >> >> >>
> >> >> >> >> Attached is the pgpool.log.
> >> >> >> >>
> >> >> >> >> Best regards,
> >> >> >> >> --
> >> >> >> >> Tatsuo Ishii
> >> >> >> >> SRA OSS K.K.
> >> >> >> >> English: http://www.sraoss.co.jp/index_en/
> >> >> >> >> Japanese:http://www.sraoss.co.jp
> >> >> >> >>
> >> >> >> >> > On Mon, Nov 3, 2025 at 9:05 AM Tatsuo Ishii <
> >> [email protected]>
> >> >> >> >> wrote:
> >> >> >> >> >
> >> >> >> >> >> > thanks and sorry for the issues, please find attached
> updated
> >> >> >> version.
> >> >> >> >> >>
> >> >> >> >> >> No problem.
> >> >> >> >> >>
> >> >> >> >> >> This time the patch applies fine, no compiler warnings.
> >> However,
> >> >> >> >> >> regression test did not passed here (on Ubuntu 24 LTS if
> this
> >> >> >> >> >> matters).  So I looked into
> >> >> >> >> >>
> >> src/test/regression/tests/041.external_replication_delay/test.sh a
> >> >> >> >> >> little bit and apply attached patch (test.sh.patch). It
> moved
> >> >> forward
> >> >> >> >> >> partially but failed at:
> >> >> >> >> >>
> >> >> >> >> >> fail: command execution failure not detected
> >> >> >> >> >>
> >> >> >> >> >> Please find attached
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> src/test/regression/tests/041.external_replication_delay/testdir/pgpool.log
> >> >> >> >> >> and src/test/regression/log/041.external_replication_delay.
> >> >> >> >> >>
> >> >> >> >> >> Best regards,
> >> >> >> >> >> --
> >> >> >> >> >> Tatsuo Ishii
> >> >> >> >> >> SRA OSS K.K.
> >> >> >> >> >> English: http://www.sraoss.co.jp/index_en/
> >> >> >> >> >> Japanese:http://www.sraoss.co.jp
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > --
> >> >> >> >> > Nadav Shatz
> >> >> >> >> > Tailor Brands | CTO
> >> >> >> >>
> >> >> >> >
> >> >> >> >
> >> >> >> > --
> >> >> >> > Nadav Shatz
> >> >> >> > Tailor Brands | CTO
> >> >> >>
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Nadav Shatz
> >> >> > Tailor Brands | CTO
> >> >>
> >> >
> >> >
> >> > --
> >> > Nadav Shatz
> >> > Tailor Brands | CTO
> >>
> >
> >
> > --
> > Nadav Shatz
> > Tailor Brands | CTO
>


-- 
Nadav Shatz
Tailor Brands | CTO


Attachments:

  [application/octet-stream] latest.patch (51.4K, 3-latest.patch)
  download | inline diff:
From aaf150f195ac453405abbfbde3efe0f5fde64a38 Mon Sep 17 00:00:00 2001
From: Nadav Shatz <[email protected]>
Date: Tue, 23 Dec 2025 13:39:04 +0200
Subject: [PATCH]    feat: external replication delay injection via external
 command

    Add support for obtaining replication delay from an external command
    instead of querying pg_stat_replication directly. This allows for
    more flexible monitoring setups where replication delay information
    may come from external monitoring systems.

    New configuration parameters:
    - replication_delay_source_cmd: Path to external command that provides
      delay values. When set, pgpool calls this command instead of querying
      PostgreSQL directly.
    - replication_delay_source_timeout: Timeout in seconds for the external
      command (default: 10).

    The external command receives replica identifiers as arguments in
    "host:port" format and should output delay values in milliseconds,
    one per line, corresponding to each replica argument.

    Includes regression test (041.external_replication_delay) covering:
    - Argument format validation
    - Integer and floating-point delay parsing
    - Error handling for malformed output and timeouts

diff --git a/doc/src/sgml/stream-check.sgml b/doc/src/sgml/stream-check.sgml
index d2ca3ca49c62dd481fb8e18616b12ab521521b1f..fc479908072f6afc63923ac699be0f63e15bc90a 100644
--- a/doc/src/sgml/stream-check.sgml
+++ b/doc/src/sgml/stream-check.sgml
@@ -309,6 +309,74 @@ GRANT pg_monitor TO sr_check_user;
     </listitem>
   </varlistentry>
 
+  <varlistentry id="guc-replication-delay-source-cmd" xreflabel="replication_delay_source_cmd">
+   <term><varname>replication_delay_source_cmd</varname> (<type>string</type>)
+    <indexterm>
+     <primary><varname>replication_delay_source_cmd</varname> configuration parameter</primary>
+    </indexterm>
+   </term>
+   <listitem>
+    <para>
+     Specifies an external command to retrieve replication delay information for replica nodes.
+     When this parameter is set and not empty, <productname>Pgpool-II</productname> uses the
+     external command instead of built-in database queries to obtain replication delays.
+     The command is executed as the <productname>Pgpool-II</productname> process user.
+    </para>
+    <para>
+     The command receives replica node identifiers as positional arguments, with the primary
+     node omitted. Each identifier is in the format <literal>&lt;hostname&gt;:&lt;port&gt;</literal>,
+     for example <literal>server1:5432 server2:5432</literal>. The order matches
+     <productname>Pgpool-II</productname>'s backend order (excluding the primary), allowing the
+     script to correlate external metrics (such as from AWS CloudWatch for Aurora) to the correct nodes.
+    </para>
+    <para>
+     The command must write a single line to stdout containing one whitespace-separated delay value
+     per replica, in milliseconds, in the same order as the arguments. The primary node's delay is
+     implicitly zero and should not be included in the output. Delay values can be integers or
+     floating-point numbers.
+    </para>
+    <para>
+     Special value: <literal>-1</literal> indicates a replica that is down but not yet detected
+     by <productname>Pgpool-II</productname>'s health checks. <productname>Pgpool-II</productname>
+     will log this condition but rely on its own health-check logic to decide whether to trigger
+     failover; no failover is triggered solely by receiving <literal>-1</literal>.
+    </para>
+    <para>
+     Example for a 3-node cluster (1 primary + 2 replicas): if the command receives arguments
+     <literal>server1:5432 server2:5432</literal>, it should output <literal>"25.5 100"</literal>
+     to indicate the first replica has 25.5ms delay and the second has 100ms delay.
+    </para>
+    <para>
+     Default is empty (use built-in replication delay queries).
+    </para>
+    <para>
+     This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+    </para>
+   </listitem>
+  </varlistentry>
+
+  <varlistentry id="guc-replication-delay-source-timeout" xreflabel="replication_delay_source_timeout">
+   <term><varname>replication_delay_source_timeout</varname> (<type>integer</type>)
+    <indexterm>
+     <primary><varname>replication_delay_source_timeout</varname> configuration parameter</primary>
+    </indexterm>
+   </term>
+   <listitem>
+    <para>
+     Specifies the timeout in seconds for the external command specified by
+     <xref linkend="guc-replication-delay-source-cmd">.
+     If the command does not finish within the timeout, <productname>Pgpool-II</productname>
+     logs an error and continues using the built-in method.
+    </para>
+    <para>
+     Default is 10 seconds. Valid range is 1-3600 seconds.
+    </para>
+    <para>
+     This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+    </para>
+   </listitem>
+  </varlistentry>
+
   <varlistentry id="guc-log-standby-delay" xreflabel="log_standby_delay">
    <term><varname>log_standby_delay</varname> (<type>enum</type>)
     <indexterm>
diff --git a/src/config/pool_config_variables.c b/src/config/pool_config_variables.c
index 0a0e483149190e14ca13406c08d0ee2ac0a9c53a..7c6d1803117541aaba50d9f9ff62e41e145c5d95 100644
--- a/src/config/pool_config_variables.c
+++ b/src/config/pool_config_variables.c
@@ -980,6 +980,16 @@ static struct config_string ConfigureNamesString[] =
 		NULL, NULL, NULL, NULL
 	},
 
+	{
+		{"replication_delay_source_cmd", CFGCXT_RELOAD, STREAMING_REPLICATION_CONFIG,
+			"External command to retrieve replication delay information.",
+			CONFIG_VAR_TYPE_STRING, false, 0
+		},
+		&g_pool_config.replication_delay_source_cmd,
+		"",
+		NULL, NULL, NULL, NULL
+	},
+
 	{
 		{"failback_command", CFGCXT_RELOAD, FAILOVER_CONFIG,
 			"Command to execute when backend node is attached.",
@@ -2334,6 +2344,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"replication_delay_source_timeout", CFGCXT_RELOAD, STREAMING_REPLICATION_CONFIG,
+			"Timeout for external replication delay command execution in seconds.",
+			CONFIG_VAR_TYPE_INT, false, 0
+		},
+		&g_pool_config.replication_delay_source_timeout,
+		10,
+		1, 3600,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	EMPTY_CONFIG_INT
 };
diff --git a/src/include/pool_config.h b/src/include/pool_config.h
index 758d515525c93c1c3f2686da049b294a286a574a..6f5f88fb200c2ea82cccddd8f02c35c8d0ade8f4 100644
--- a/src/include/pool_config.h
+++ b/src/include/pool_config.h
@@ -86,7 +86,6 @@ typedef enum LogStandbyDelayModes
 	LSD_NONE
 } LogStandbyDelayModes;
 
-
 typedef enum MemCacheMethod
 {
 	SHMEM_CACHE = 1,
@@ -364,6 +363,8 @@ typedef struct
 	char	   *sr_check_password;	/* password for sr_check_user */
 	char	   *sr_check_database;	/* PostgreSQL database name for streaming
 									 * replication check */
+	char	   *replication_delay_source_cmd;	/* external command for replication delay */
+	int			replication_delay_source_timeout;	/* timeout for external command in seconds */
 	char	   *failover_command;	/* execute command when failover happens */
 	char	   *follow_primary_command; /* execute command when failover is
 										 * ended */
diff --git a/src/sample/pgpool.conf.sample-stream b/src/sample/pgpool.conf.sample-stream
index 797906491cb996d24c59b3710f462f1405737248..454fdb9e5d1fd65437b6a67f12ab62658ea08f49 100644
--- a/src/sample/pgpool.conf.sample-stream
+++ b/src/sample/pgpool.conf.sample-stream
@@ -519,6 +519,20 @@ backend_clustering_mode = streaming_replication
 
 #sr_check_database = 'postgres'
                                    # Database name for streaming replication check
+
+#replication_delay_source_cmd = ''
+                                   # External command to retrieve replication delay information
+                                   # If set, pgpool uses this command instead of built-in queries
+                                   # Command receives replica node identifiers (host:port) as arguments
+                                   # Primary node is omitted from arguments
+                                   # Command should output one delay value (in ms) per replica
+                                   # Use -1 to indicate a replica that is down but not yet detected
+                                   # Format: "25 100" for 2 replicas (e.g., 3-node cluster with 1 primary)
+                                   # Command runs as the pgpool process user
+#replication_delay_source_timeout = 10
+                                   # Timeout for external command execution in seconds
+                                   # Range: 1-3600 seconds (default: 10)
+
 #delay_threshold = 0
                                    # Threshold before not dispatching query to standby node
                                    # Unit is in bytes
diff --git a/src/streaming_replication/pool_worker_child.c b/src/streaming_replication/pool_worker_child.c
index 5bf19c37d0cf1033c624f34ab3737f18871bc2f5..457d0fab0912d44b19de70c05be8f2046dd987c5 100644
--- a/src/streaming_replication/pool_worker_child.c
+++ b/src/streaming_replication/pool_worker_child.c
@@ -43,6 +43,7 @@
 #include <unistd.h>
 #include <stdlib.h>
 #include <sys/time.h>
+#include <sys/wait.h>
 
 #ifdef HAVE_CRYPT_H
 #include <crypt.h>
@@ -76,6 +77,8 @@ static volatile sig_atomic_t restart_request = 0;
 static void establish_persistent_connection(void);
 static void discard_persistent_connection(void);
 static void check_replication_time_lag(void);
+static void check_replication_time_lag_with_cmd(void);
+static char *build_instance_identifier_for_node(int node_id);
 static void CheckReplicationTimeLagErrorCb(void *arg);
 static unsigned long long int text_to_lsn(char *text);
 static RETSIGTYPE my_signal_handler(int sig);
@@ -259,11 +262,16 @@ do_worker_child(void *params)
 					POOL_NODE_STATUS *node_status;
 					int			i;
 
-					/* Do replication time lag checking */
-					check_replication_time_lag();
+			/* Do replication time lag checking */
+			/* Use external command if replication_delay_source_cmd is configured */
+			if (pool_config->replication_delay_source_cmd &&
+				strlen(pool_config->replication_delay_source_cmd) > 0)
+				check_replication_time_lag_with_cmd();
+			else
+				check_replication_time_lag();
 
-					/* Check node status */
-					node_status = verify_backend_node_status(slots);
+			/* Check node status */
+			node_status = verify_backend_node_status(slots);
 
 
 					for (i = 0; i < NUM_BACKENDS; i++)
@@ -659,6 +667,420 @@ check_replication_time_lag(void)
 	error_context_stack = callback.previous;
 }
 
+#define MAX_CMD_OUTPUT 4096
+#define MAX_REASONABLE_DELAY_MS 3600000.0  /* 1 hour in milliseconds */
+
+/*
+ * Check replication time lag using external command
+ *
+ * The external command receives only replica (standby) node identifiers as arguments,
+ * omitting the primary node. It returns delay values in milliseconds for each replica.
+ * A value of -1 indicates a node that is down but not yet detected by pgpool's health checks.
+ */
+static void
+check_replication_time_lag_with_cmd(void)
+{
+	char		   *command = NULL;
+	char		   *line;
+	char		   *token;
+	char		   *saveptr;
+	double			delay_ms;
+	uint64			delay;
+	int				token_count = 0;
+	BackendInfo	   *bkinfo;
+	ErrorContextCallback callback;
+	int				pipefd[2] = {-1, -1};
+	pid_t			pid = -1;
+	int				ret;
+	struct timeval	timeout;
+	fd_set			readfds;
+	ssize_t			bytes_read;
+	int				status;
+	int				num_replicas;
+
+	if (NUM_BACKENDS <= 1)
+	{
+		/* If there's only one node, there's no point to do checking */
+		return;
+	}
+
+	if (REAL_PRIMARY_NODE_ID < 0)
+	{
+		/* No need to check if there's no primary */
+		return;
+	}
+
+	if (!VALID_BACKEND(REAL_PRIMARY_NODE_ID))
+	{
+		/* No need to check replication delay if primary is down */
+		return;
+	}
+
+	/* Capture primary node ID to avoid race conditions during execution */
+	int primary_node_id = REAL_PRIMARY_NODE_ID;
+
+	if (!pool_config->replication_delay_source_cmd ||
+		strlen(pool_config->replication_delay_source_cmd) == 0)
+	{
+		ereport(WARNING,
+				(errmsg("replication_delay_source_cmd is not configured"),
+				 errhint("Set replication_delay_source_cmd to use external command mode")));
+		/* Fall back to builtin method */
+		check_replication_time_lag();
+		return;
+	}
+
+	/* Allocate buffer for command output */
+	line = palloc(MAX_CMD_OUTPUT);
+	memset(line, 0, MAX_CMD_OUTPUT);
+
+	/*
+	 * Register a error context callback to throw proper context message
+	 */
+	callback.callback = CheckReplicationTimeLagErrorCb;
+	callback.arg = NULL;
+	callback.previous = error_context_stack;
+	error_context_stack = &callback;
+
+	/* Execute command as current process user */
+	PG_TRY();
+	{
+		const char *base_command = pool_config->replication_delay_source_cmd;
+		size_t total_len = strlen(base_command) + 1; /* +1 for NUL */
+
+		/* Build command with replica-only arguments (omit primary) */
+		/* Calculate total command length including space-separated replica identifiers */
+		for (int i = 0; i < NUM_BACKENDS; i++)
+		{
+			if (i == primary_node_id)
+				continue; /* Skip primary node */
+
+			char *ident = build_instance_identifier_for_node(i);
+			total_len += 1 /* space */ + strlen(ident);
+			pfree(ident);
+		}
+
+		command = palloc(total_len);
+		strlcpy(command, base_command, total_len);
+
+		/* Append replica identifiers */
+		size_t current_len = strlen(command);
+		for (int i = 0; i < NUM_BACKENDS; i++)
+		{
+			if (i == primary_node_id)
+				continue; /* Skip primary node */
+
+			char *ident = build_instance_identifier_for_node(i);
+
+			/* Append space and identifier */
+			snprintf(command + current_len, total_len - current_len, " %s", ident);
+			current_len += strlen(command + current_len);
+
+			pfree(ident);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("executing replication delay command: %s", command)));
+
+		if (pipe(pipefd) == -1)
+		{
+			ereport(ERROR,
+					(errmsg("pipe failed: %m")));
+		}
+
+		pid = fork();
+		if (pid == -1)
+		{
+			close(pipefd[0]);
+			close(pipefd[1]);
+			ereport(ERROR,
+					(errmsg("fork failed: %m")));
+		}
+
+		if (pid == 0)
+		{
+			/* Child process */
+			close(pipefd[0]); /* Close read end */
+			if (dup2(pipefd[1], STDOUT_FILENO) == -1)
+			{
+				fprintf(stderr, "dup2 failed: %s\n", strerror(errno));
+				exit(1);
+			}
+			close(pipefd[1]); /* Close write end (duplicated to stdout) */
+
+			/* Execute command using shell */
+			execl("/bin/sh", "sh", "-c", command, (char *)NULL);
+
+			/* If execl fails */
+			fprintf(stderr, "execl failed: %s\n", strerror(errno));
+			_exit(127);
+		}
+
+		/* Parent process */
+		close(pipefd[1]); /* Close write end */
+		pipefd[1] = -1;
+
+		/* Set up timeout for select */
+		timeout.tv_sec = pool_config->replication_delay_source_timeout;
+		timeout.tv_usec = 0;
+
+		FD_ZERO(&readfds);
+		FD_SET(pipefd[0], &readfds);
+
+		/* Wait for output or timeout */
+		ret = select(pipefd[0] + 1, &readfds, NULL, NULL, &timeout);
+
+		if (ret == -1)
+		{
+			int save_errno = errno;
+			kill(pid, SIGKILL);
+			waitpid(pid, NULL, 0);
+			pid = -1;
+			close(pipefd[0]);
+			pipefd[0] = -1;
+			if (save_errno == EINTR)
+			{
+				/* Interrupted */
+				ereport(ERROR,
+						(errmsg("select interrupted during replication delay command execution")));
+			}
+			else
+			{
+				ereport(ERROR,
+						(errmsg("select failed: %m")));
+			}
+		}
+		else if (ret == 0)
+		{
+			/* Timeout */
+			kill(pid, SIGKILL);
+			waitpid(pid, NULL, 0);
+			pid = -1;
+			close(pipefd[0]);
+			pipefd[0] = -1;
+			ereport(ERROR,
+					(errmsg("replication delay command timed out after %d seconds: %s",
+							pool_config->replication_delay_source_timeout, command),
+					 errhint("Consider increasing replication_delay_source_timeout or optimizing the command")));
+		}
+
+		/* Data is available */
+		bytes_read = read(pipefd[0], line, MAX_CMD_OUTPUT - 1);
+		close(pipefd[0]);
+		pipefd[0] = -1;
+
+		/* Wait for child to finish */
+		waitpid(pid, &status, 0);
+		pid = -1;
+
+		if (bytes_read < 0)
+		{
+			ereport(ERROR,
+					(errmsg("failed to read output from replication delay command: %s", command),
+					 errdetail("read failed: %m")));
+		}
+
+		/* Check exit status */
+		if (WIFEXITED(status) && WEXITSTATUS(status) != 0)
+		{
+			ereport(ERROR,
+					(errmsg("replication delay command failed with exit code %d: %s",
+							WEXITSTATUS(status), command)));
+		}
+		else if (WIFSIGNALED(status))
+		{
+			ereport(ERROR,
+					(errmsg("replication delay command terminated by signal %d: %s",
+							WTERMSIG(status), command)));
+		}
+
+		/* Check if output was truncated */
+		if (bytes_read == MAX_CMD_OUTPUT - 1 && line[MAX_CMD_OUTPUT - 2] != '\n')
+		{
+			ereport(WARNING,
+					(errmsg("replication delay command output may have been truncated")));
+		}
+
+		/* Null-terminate the string */
+		line[bytes_read] = '\0';
+
+		pfree(command);
+		command = NULL;
+
+		/* Set primary node delay to 0 */
+		bkinfo = pool_get_node_info(primary_node_id);
+		bkinfo->standby_delay = 0;
+		bkinfo->standby_delay_by_time = true;
+
+		/* Count expected replicas */
+		num_replicas = NUM_BACKENDS - 1; /* Total nodes minus primary */
+
+		/* Count tokens in output for validation */
+		char *line_copy = pstrdup(line);
+		char *temp_token = strtok(line_copy, " \t\n");
+		while (temp_token != NULL)
+		{
+			token_count++;
+			temp_token = strtok(NULL, " \t\n");
+		}
+		pfree(line_copy);
+
+		/* Validate output format */
+		if (token_count == 0)
+		{
+			ereport(WARNING,
+					(errmsg("replication delay command produced no output"),
+					 errhint("Command should output delay values separated by spaces, one per replica node")));
+		}
+		else if (token_count < num_replicas)
+		{
+			ereport(WARNING,
+					(errmsg("replication delay command returned %d values, expected %d (one per replica, excluding primary)",
+							token_count, num_replicas),
+					 errhint("Command should output one delay value per replica node. Missing values will be treated as 0.")));
+		}
+		else if (token_count > num_replicas)
+		{
+			ereport(WARNING,
+					(errmsg("replication delay command returned %d values, expected %d (one per replica, excluding primary)",
+							token_count, num_replicas),
+					 errhint("Command should output exactly one delay value per replica node. Extra values will be ignored.")));
+		}
+
+		/* Parse the output - one delay value per replica in order */
+		token = strtok_r(line, " \t\n", &saveptr);
+
+		for (int i = 0; i < NUM_BACKENDS && token != NULL; i++)
+		{
+			if (i == primary_node_id)
+				continue; /* Skip primary - it's not in the output */
+
+			if (!VALID_BACKEND(i))
+			{
+				/* Skip invalid backend but consume token */
+				token = strtok_r(NULL, " \t\n", &saveptr);
+				continue;
+			}
+
+			char *endptr;
+			delay_ms = strtod(token, &endptr);
+
+			/* Validate the conversion */
+			if (*endptr != '\0')
+			{
+				ereport(WARNING,
+						(errmsg("invalid delay value '%s' for node %d, treating as 0",
+								token, i)));
+				delay_ms = 0;
+			}
+
+			bkinfo = pool_get_node_info(i);
+
+			/* Handle -1 for down nodes */
+			if (delay_ms == -1.0)
+			{
+				ereport(LOG,
+						(errmsg("node %d reported as down by external command (delay -1), relying on health check for failover decision",
+								i)));
+				/* Keep previous delay value, don't trigger failover */
+				token = strtok_r(NULL, " \t\n", &saveptr);
+				continue;
+			}
+
+			/* Validate delay value range */
+			if (delay_ms < 0)
+			{
+				ereport(WARNING,
+						(errmsg("negative delay value %.3f for node %d (other than -1), treating as 0",
+								delay_ms, i)));
+				delay_ms = 0;
+			}
+			else if (delay_ms > MAX_REASONABLE_DELAY_MS)
+			{
+				ereport(WARNING,
+						(errmsg("extremely large delay value %.3f for node %d",
+								delay_ms, i)));
+			}
+
+			/* Convert delay from milliseconds to microseconds for internal storage */
+			delay = (uint64)(delay_ms * 1000);
+			bkinfo->standby_delay = delay;
+			bkinfo->standby_delay_by_time = true;
+
+			/* Log delay if necessary */
+			uint64 delay_threshold_by_time = pool_config->delay_threshold_by_time * 1000; /* threshold is in milliseconds, convert to microseconds */
+
+			if ((pool_config->log_standby_delay == LSD_ALWAYS && delay_ms > 0) ||
+				(pool_config->log_standby_delay == LSD_OVER_THRESHOLD &&
+				 bkinfo->standby_delay > delay_threshold_by_time))
+			{
+				ereport(LOG,
+						(errmsg("Replication of node: %d is behind %.3f second(s) from the primary server (node: %d) [external command]",
+								i, delay_ms / 1000, primary_node_id)));
+			}
+
+			token = strtok_r(NULL, " \t\n", &saveptr);
+		}
+
+	}
+	PG_CATCH();
+	{
+		/* Cleanup in case of error */
+		if (pid > 0) {
+			kill(pid, SIGKILL);
+			waitpid(pid, NULL, 0);
+		}
+		if (pipefd[0] != -1) close(pipefd[0]);
+		if (pipefd[1] != -1) close(pipefd[1]);
+
+		if (line)
+			pfree(line);
+		if (command)
+			pfree(command);
+		error_context_stack = callback.previous;
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	/* Normal cleanup */
+	if (line)
+		pfree(line);
+
+	error_context_stack = callback.previous;
+}
+
+/*
+ * build_instance_identifier_for_node
+ *  Build an identifier string for a backend node for passing to external commands.
+ *  Format: "<hostname>:<port>"
+ */
+static char *
+build_instance_identifier_for_node(int node_id)
+{
+	BackendInfo *bi = pool_get_node_info(node_id);
+	const char *hostname;
+
+	if (!bi || bi->backend_hostname[0] == '\0' || bi->backend_port <= 0)
+	{
+		/* Fallback if hostname or port is not set */
+		return psprintf("unknown_node_%d", node_id);
+	}
+
+	hostname = bi->backend_hostname;
+
+	/* Validate hostname for security - check for shell metacharacters */
+	if (strpbrk(hostname, "$`\\|;&<>()[]{}\"\'\n\r\t") != NULL)
+	{
+		ereport(LOG,
+				(errmsg("hostname for node %d contains potentially dangerous characters: %s",
+						node_id, hostname),
+				 errhint("Hostnames with shell metacharacters may pose security risks when used with external commands. Consider using IP addresses or sanitized hostnames.")));
+	}
+
+	/* Use hostname:port format */
+	return psprintf("%s:%d", hostname, bi->backend_port);
+}
+
 static void
 CheckReplicationTimeLagErrorCb(void *arg)
 {
diff --git a/src/test/regression/tests/041.external_replication_delay/README b/src/test/regression/tests/041.external_replication_delay/README
new file mode 100644
index 0000000000000000000000000000000000000000..b4df5da402b557190c8f6a2bc7822944cc5b04cc
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/README
@@ -0,0 +1,59 @@
+External Replication Delay Command Test
+========================================
+
+This test verifies the external command replication delay source feature.
+
+Test Coverage:
+- External command receives replica node identifiers only (primary omitted)
+- Instance identifiers in host:port format
+- Basic external command execution with integer and float millisecond values
+- Delay threshold functionality with external commands
+- Command execution as pgpool process user (no su wrapper)
+- Error handling for missing/invalid commands
+- Command execution failure scenarios
+- Command timeout handling with configurable timeout values
+- Input validation for invalid, negative (other than -1), and extremely large delay values
+- Handling of -1 for down nodes (logged but no immediate failover)
+- Wrong number of output values validation
+- Multiple -1 values (multiple down replicas)
+- Mixed scenarios (some replicas up, some down)
+- Output truncation detection
+
+Files:
+- test.sh: Main test script
+- test_parsing.sh: Unit test for parsing logic
+- test_validation.sh: Validation and edge case testing
+- README: This documentation
+
+Key Changes from Original Version:
+- Primary node is omitted from command arguments
+- Command receives only replica identifiers
+- Instance identifiers are in host:port format (not application_name)
+- Output format: one delay per replica (not per all nodes)
+- -1 value indicates down replica without triggering failover
+- Format example: "25 100" for 2 replicas (3-node cluster = 1 primary + 2 replicas)
+
+The test creates temporary command scripts that output delay values in the format:
+"replica1_delay replica2_delay ..."
+
+Where delays are in milliseconds and can be integer or floating-point values.
+Special value -1 indicates a replica that is down but not yet detected by pgpool.
+
+Test Environment:
+- Uses streaming replication mode with 3 nodes
+- Node 0 is primary (omitted from command arguments)
+- Nodes 1 and 2 are replicas (included in command arguments)
+- Configures sr_check_period = 1 second for faster testing
+- Tests various delay scenarios and threshold behaviors
+
+Expected Behavior:
+- External commands receive replica identifiers in host:port format
+- Primary node identifier is never passed to command
+- Command outputs one delay value per replica
+- -1 values are logged but don't trigger immediate failover
+- Delay values are parsed correctly (both int and float)
+- Threshold comparisons work properly
+- Error conditions are handled gracefully
+- Commands timeout appropriately based on configuration
+- Timeout errors provide helpful messages and hints
+- Tests are reliable with proper wait mechanisms instead of fixed sleeps
diff --git a/src/test/regression/tests/041.external_replication_delay/test.sh b/src/test/regression/tests/041.external_replication_delay/test.sh
new file mode 100755
index 0000000000000000000000000000000000000000..de704e55331247893f4b2e26fb67977875f1ba42
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/test.sh
@@ -0,0 +1,409 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for external command replication delay source
+#
+source $TESTLIBS
+TESTDIR=testdir
+PG_CTL=$PGBIN/pg_ctl
+PSQL="$PGBIN/psql -X "
+
+rm -fr $TESTDIR
+mkdir $TESTDIR
+cd $TESTDIR
+
+# create test environment
+echo -n "creating test environment..."
+$PGPOOL_SETUP -m s -n 3 || exit 1
+echo "done."
+source ./bashrc.ports
+export PGPORT=$PGPOOL_PORT
+
+# Create external command scripts for testing
+# NOTE: Commands now only output delay values for REPLICAS (not primary)
+cat > delay_cmd_static.sh << 'EOF'
+#!/bin/bash
+# Static delay values for replicas: node1=25ms, node2=50ms (node0 is primary, not included)
+echo "25 50"
+EOF
+chmod +x delay_cmd_static.sh
+
+cat > delay_cmd_float.sh << 'EOF'
+#!/bin/bash
+# Float delay values for replicas: node1=25.5ms, node2=100.75ms
+echo "25.5 100.75"
+EOF
+chmod +x delay_cmd_float.sh
+
+cat > delay_cmd_high.sh << 'EOF'
+#!/bin/bash
+# High delay values to test threshold: node1=2000ms, node2=3000ms
+echo "2000 3000"
+EOF
+chmod +x delay_cmd_high.sh
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test0: External command receives replica identifiers only (primary omitted) ==="
+# ----------------------------------------------------------------------------------------
+# Command that captures its arguments and outputs valid delays for 2 replicas
+cat > delay_cmd_args.sh << 'EOF'
+#!/bin/bash
+printf "%s " "$@" > args.txt
+echo "25 50"
+EOF
+chmod +x delay_cmd_args.sh
+
+echo "replication_delay_source_cmd = './delay_cmd_args.sh'" >> etc/pgpool.conf
+echo "sr_check_period = 1" >> etc/pgpool.conf
+echo "log_min_messages = 'DEBUG1'" >> etc/pgpool.conf
+# Reduce memory requirements for macOS shared memory limits
+echo "num_init_children = 4" >> etc/pgpool.conf
+echo "max_pool = 2" >> etc/pgpool.conf
+# Disable query caching to avoid shared memory issues on macOS
+echo "memory_cache_enabled = off" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+echo "Waiting for sr_check to pass args..."
+for i in {1..10}; do
+    if [ -f args.txt ]; then
+        break
+    fi
+    sleep 1
+done
+
+if [ ! -f args.txt ]; then
+    echo fail: did not capture command arguments
+    ./shutdownall
+    exit 1
+fi
+
+ARGS_CONTENT=$(cat args.txt | sed 's/[[:space:]]*$//')
+# Should receive 2 replica identifiers in host:port format (localhost:11003 localhost:11004 or server1:11003 server2:11004)
+# Primary (localhost:11002 or server0:11002) should be omitted
+if ! echo "$ARGS_CONTENT" | grep -qE "(server1|localhost):11003"; then
+    echo "fail: expected replica1:11003 in arguments, got: '$ARGS_CONTENT'"
+    ./shutdownall
+    exit 1
+fi
+if ! echo "$ARGS_CONTENT" | grep -qE "(server2|localhost):11004"; then
+    echo "fail: expected replica2:11004 in arguments, got: '$ARGS_CONTENT'"
+    ./shutdownall
+    exit 1
+fi
+if echo "$ARGS_CONTENT" | grep -qE "(server0|localhost):11002"; then
+    echo "fail: primary should not be in arguments, got: '$ARGS_CONTENT'"
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: argument order correct - replicas only, primary omitted, host:port format
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test1: Basic external command with integer millisecond values ==="
+# ----------------------------------------------------------------------------------------
+echo "replication_delay_source_cmd = './delay_cmd_static.sh'" >> etc/pgpool.conf
+echo "sr_check_period = 1" >> etc/pgpool.conf
+echo "log_standby_delay = 'always'" >> etc/pgpool.conf
+echo "log_min_messages = 'DEBUG1'" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+$PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+EOF
+
+# Wait for sr_check to run and populate delay values
+# sr_check_period is 1 second, so wait a bit longer to ensure it runs
+echo "Waiting for sr_check to run..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command" log/pgpool.log 2>/dev/null; then
+        echo "Command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SHOW POOL_NODES;
+EOF
+
+# Check that delay values are populated in the log
+grep "executing replication delay command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: external command was not executed
+    echo "Log contents:"
+    tail -20 log/pgpool.log
+    ./shutdownall
+    exit 1
+fi
+
+# Verify actual delay values were parsed
+if ! $PSQL -t -c "SHOW POOL_NODES" test | grep -E "[0-9]+\.[0-9]+" >/dev/null; then
+    echo "Warning: No delay values found in POOL_NODES output"
+fi
+
+# Check for delay log messages
+grep "Replication of node.*external command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: external command delay logging not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: basic external command test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test2: External command with floating-point millisecond values ==="
+# ----------------------------------------------------------------------------------------
+# Update configuration to use float command
+sed -i.bak "s|delay_cmd_static.sh|delay_cmd_float.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run with float values
+echo "Waiting for sr_check with float values..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*delay_cmd_float.sh" log/pgpool.log 2>/dev/null; then
+        echo "Float command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SHOW POOL_NODES;
+EOF
+
+# Check that float values are handled correctly
+grep "executing replication delay command.*delay_cmd_float.sh" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: float command was not executed
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: floating-point values test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test3: External command with delay threshold ==="
+# ----------------------------------------------------------------------------------------
+# Update configuration to use high delay command and set threshold
+sed -i.bak "s|delay_cmd_float.sh|delay_cmd_high.sh|" etc/pgpool.conf
+echo "delay_threshold_by_time = 1000" >> etc/pgpool.conf
+echo "backend_weight0 = 0" >> etc/pgpool.conf  # Force queries to standby normally
+echo "backend_weight2 = 0" >> etc/pgpool.conf  # Only use node 1 as standby
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run and detect high delays
+echo "Waiting for sr_check with high delay values..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*delay_cmd_high.sh" log/pgpool.log 2>/dev/null; then
+        echo "High delay command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SELECT * FROM t1 LIMIT 1;
+EOF
+
+# With high delays (2000ms > 1000ms threshold), query should go to primary (node 0)
+# Log format can vary: either "statement: SELECT..." or "SELECT... DB node id:"
+if ! grep -E "DB node id: 0.*statement: SELECT \* FROM t1 LIMIT 1" log/pgpool.log >/dev/null 2>&1 && \
+   ! grep -E "SELECT \* FROM t1 LIMIT 1.*DB node id: 0" log/pgpool.log >/dev/null 2>&1; then
+    echo fail: query was not sent to primary node despite high delay
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: delay threshold test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test4: External command execution as process user ==="
+# ----------------------------------------------------------------------------------------
+# Test that command runs as the current pgpool process user
+sed -i.bak "s|delay_cmd_high.sh|delay_cmd_static.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for sr_check to run as process user..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*delay_cmd_static.sh" log/pgpool.log 2>/dev/null; then
+        echo "Command executed as process user after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check that command was executed (without su wrapper)
+grep "executing replication delay command.*delay_cmd_static.sh" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: command was not executed as process user
+    ./shutdownall
+    exit 1
+fi
+
+# Verify no su command was used
+if grep -q "executing replication delay command.*su.*" log/pgpool.log 2>/dev/null; then
+    echo fail: command should not use su wrapper
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: process user execution test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test5: Error handling - missing command ==="
+# ----------------------------------------------------------------------------------------
+# Test error handling when command is not configured
+sed -i.bak "s|replication_delay_source_cmd = './delay_cmd_static.sh'|replication_delay_source_cmd = ''|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# With empty command, should fall back to builtin method
+# No specific error message expected - just verify it doesn't crash
+sleep 3
+
+echo "ok: empty command test succeeded (fallback to builtin)"
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test6: Error handling - command execution failure ==="
+# ----------------------------------------------------------------------------------------
+# Test error handling when command fails
+echo "replication_delay_source_cmd = './nonexistent_command.sh'" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run with failing command
+echo "Waiting for sr_check with failing command..."
+for i in {1..5}; do
+    # Check for various error conditions: exit code failure, no output, or explicit failure message
+    if grep -qE "(replication delay command failed with exit code|replication delay command produced no output|failed to (execute|read output from) replication delay command)" log/pgpool.log 2>/dev/null; then
+        echo "Command failure detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for error message about command execution failure
+# Accept multiple possible error messages depending on shell behavior:
+# - "failed with exit code" when command returns non-zero
+# - "produced no output" when command produces empty output
+# - "failed to execute/read" for other failures
+if ! grep -qE "(replication delay command failed with exit code|replication delay command produced no output|failed to (execute|read output from) replication delay command)" log/pgpool.log 2>/dev/null; then
+    echo fail: command execution failure not detected
+    echo "Log contents:"
+    tail -50 log/pgpool.log
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: command failure test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test7: Command timeout handling ==="
+# ----------------------------------------------------------------------------------------
+# Create a command that takes longer than the timeout
+cat > delay_cmd_slow.sh << 'EOF'
+#!/bin/bash
+# Slow command that takes 15 seconds (longer than default 10s timeout)
+sleep 15
+echo "25 50"
+EOF
+chmod +x delay_cmd_slow.sh
+
+# Set a short timeout and use the slow command
+sed -i.bak "s|replication_delay_source_cmd = './nonexistent_command.sh'|replication_delay_source_cmd = './delay_cmd_slow.sh'|" etc/pgpool.conf
+echo "replication_delay_source_timeout = 3" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run and timeout
+echo "Waiting for command timeout..."
+for i in {1..15}; do
+    if grep -q "replication delay command timed out" log/pgpool.log 2>/dev/null; then
+        echo "Command timeout detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for timeout error message
+grep "replication delay command timed out after 3 seconds" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: command timeout not detected
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: command timeout test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test8: Handling of -1 for down nodes ==="
+# ----------------------------------------------------------------------------------------
+# Create a command that returns -1 for one replica
+cat > delay_cmd_with_down_node.sh << 'EOF'
+#!/bin/bash
+# Return -1 for first replica (indicating it's down), normal value for second
+echo "-1 50"
+EOF
+chmod +x delay_cmd_with_down_node.sh
+
+# Reset config
+rm -f etc/pgpool.conf.bak
+sed -i.bak "s|delay_cmd_slow.sh|delay_cmd_with_down_node.sh|" etc/pgpool.conf
+sed -i.bak "s|replication_delay_source_timeout = 3|replication_delay_source_timeout = 10|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to process -1 value
+echo "Waiting for sr_check to process -1 value..."
+for i in {1..10}; do
+    if grep -q "node.*reported as down by external command.*delay -1" log/pgpool.log 2>/dev/null; then
+        echo "-1 handling detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for -1 logging message
+grep "node.*reported as down by external command.*delay -1.*relying on health check" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: -1 handling message not found
+    ./shutdownall
+    exit 1
+fi
+
+# Verify that pgpool didn't trigger failover just from -1
+# Check for actual failover execution, not just config mentions of failover_command
+if grep -qE "(starting.*(failover|degeneration)|failover done|execute.*(failover|failback)_command)" log/pgpool.log 2>/dev/null; then
+    echo "fail: -1 should not trigger immediate failover"
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: -1 handling test succeeded
+./shutdownall
+
+echo "All external replication delay tests passed!"
+exit 0
diff --git a/src/test/regression/tests/041.external_replication_delay/test_parsing.sh b/src/test/regression/tests/041.external_replication_delay/test_parsing.sh
new file mode 100755
index 0000000000000000000000000000000000000000..82fdad144cf5a94efbf79020a50ebc2ef00d6fb8
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/test_parsing.sh
@@ -0,0 +1,54 @@
+#!/bin/bash
+#-------------------------------------------------------------------
+# Unit test for external command parsing logic
+# This tests the parsing without needing a full pgpool setup
+#
+
+echo "=== Testing external command output parsing ==="
+
+# Test 1: Integer values
+echo "Test 1: Integer millisecond values"
+echo "0 25 50" > test_output.txt
+echo "Expected: 0ms, 25ms, 50ms"
+echo "Output: $(cat test_output.txt)"
+echo ""
+
+# Test 2: Float values
+echo "Test 2: Floating-point millisecond values"
+echo "0 25.5 100.75" > test_output_float.txt
+echo "Expected: 0ms, 25.5ms, 100.75ms"
+echo "Output: $(cat test_output_float.txt)"
+echo ""
+
+# Test 3: High precision float values
+echo "Test 3: High precision values"
+echo "0 0.001 999.999" > test_output_precision.txt
+echo "Expected: 0ms, 0.001ms, 999.999ms"
+echo "Output: $(cat test_output_precision.txt)"
+echo ""
+
+# Test 4: Edge case - zero values
+echo "Test 4: All zero values"
+echo "0 0 0" > test_output_zeros.txt
+echo "Expected: 0ms, 0ms, 0ms"
+echo "Output: $(cat test_output_zeros.txt)"
+echo ""
+
+# Test 5: Edge case - large values
+echo "Test 5: Large delay values"
+echo "0 5000 10000" > test_output_large.txt
+echo "Expected: 0ms, 5000ms, 10000ms"
+echo "Output: $(cat test_output_large.txt)"
+echo ""
+
+# Test 6: Mixed integer and float values
+echo "Test 6: Mixed integer and float values"
+echo "0 25 50.5" > test_output_mixed.txt
+echo "Expected: 0ms, 25ms, 50.5ms"
+echo "Output: $(cat test_output_mixed.txt)"
+echo ""
+
+# Cleanup
+rm -f test_output_*.txt
+
+echo "All parsing tests completed. These outputs should be parseable by the external command feature."
diff --git a/src/test/regression/tests/041.external_replication_delay/test_validation.sh b/src/test/regression/tests/041.external_replication_delay/test_validation.sh
new file mode 100755
index 0000000000000000000000000000000000000000..2cd4a7f0b35e152b6d4b770931ed4821cdd9d201
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/test_validation.sh
@@ -0,0 +1,323 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for external command validation and edge cases
+#
+source $TESTLIBS
+TESTDIR=testdir_validation
+PG_CTL=$PGBIN/pg_ctl
+PSQL="$PGBIN/psql -X "
+
+rm -fr $TESTDIR
+mkdir $TESTDIR
+cd $TESTDIR
+
+# create test environment
+echo -n "creating test environment..."
+$PGPOOL_SETUP -m s -n 3 || exit 1
+echo "done."
+source ./bashrc.ports
+export PGPORT=$PGPOOL_PORT
+
+# Create test command scripts
+# NOTE: All commands output values for REPLICAS only (primary omitted)
+cat > delay_cmd_validation.sh << 'EOF'
+#!/bin/bash
+# Test validation: output with invalid values for 2 replicas
+echo "invalid_value 50.5"
+EOF
+chmod +x delay_cmd_validation.sh
+
+cat > delay_cmd_negative.sh << 'EOF'
+#!/bin/bash
+# Test negative values (other than -1)
+echo "-25 50"
+EOF
+chmod +x delay_cmd_negative.sh
+
+cat > delay_cmd_large.sh << 'EOF'
+#!/bin/bash
+# Test extremely large values
+echo "9999999 50"
+EOF
+chmod +x delay_cmd_large.sh
+
+cat > delay_cmd_wrong_count.sh << 'EOF'
+#!/bin/bash
+# Test wrong number of values (only 1 instead of 2 for 2 replicas)
+echo "25"
+EOF
+chmod +x delay_cmd_wrong_count.sh
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test1: Validation of invalid delay values ==="
+# ----------------------------------------------------------------------------------------
+echo "replication_delay_source_cmd = './delay_cmd_validation.sh'" >> etc/pgpool.conf
+echo "sr_check_period = 1" >> etc/pgpool.conf
+echo "log_standby_delay = 'always'" >> etc/pgpool.conf
+echo "log_min_messages = 'DEBUG1'" >> etc/pgpool.conf
+# Reduce memory requirements for macOS shared memory limits
+echo "num_init_children = 4" >> etc/pgpool.conf
+echo "max_pool = 2" >> etc/pgpool.conf
+# Disable query caching to avoid shared memory issues on macOS
+echo "memory_cache_enabled = off" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+$PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+EOF
+
+# Wait for sr_check to run
+echo "Waiting for validation test..."
+for i in {1..10}; do
+    if grep -q "invalid delay value" log/pgpool.log 2>/dev/null; then
+        echo "Validation error detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for validation warning
+grep "invalid delay value 'invalid_value' for node" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: validation warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: invalid value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test2: Negative delay values (other than -1) ==="
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_validation.sh|delay_cmd_negative.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for negative value test..."
+for i in {1..10}; do
+    if grep -q "negative delay value.*other than -1" log/pgpool.log 2>/dev/null; then
+        echo "Negative value warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for negative value warning
+grep "negative delay value.*other than -1.*treating as 0" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: negative value warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: negative value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test3: Extremely large delay values ==="
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_negative.sh|delay_cmd_large.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for large value test..."
+for i in {1..10}; do
+    if grep -q "extremely large delay value" log/pgpool.log 2>/dev/null; then
+        echo "Large value warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for large value warning
+grep "extremely large delay value.*for node" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: large value warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: large value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test4: Wrong number of output values ==="
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_large.sh|delay_cmd_wrong_count.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for wrong count test..."
+for i in {1..10}; do
+    if grep -q "returned.*values, expected.*replica" log/pgpool.log 2>/dev/null; then
+        echo "Wrong count warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for wrong count warning
+grep "returned.*values, expected.*replica.*Command should output one delay value per replica" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: wrong count validation test not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: wrong count validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test5: Multiple -1 values ==="
+# ----------------------------------------------------------------------------------------
+cat > delay_cmd_multi_down.sh << 'EOF'
+#!/bin/bash
+# Test multiple replicas down
+echo "-1 -1"
+EOF
+chmod +x delay_cmd_multi_down.sh
+
+sed -i.bak "s|delay_cmd_wrong_count.sh|delay_cmd_multi_down.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for multi-down test..."
+for i in {1..10}; do
+    if grep -q "node.*reported as down by external command" log/pgpool.log 2>/dev/null; then
+        echo "Multiple down nodes detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for multiple -1 handling
+DOWN_COUNT=$(grep -c "node.*reported as down by external command.*delay -1" log/pgpool.log)
+if [ "$DOWN_COUNT" -lt 2 ]; then
+    echo fail: expected 2 down node messages, found $DOWN_COUNT
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: multiple -1 handling test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test6: Command timeout with different timeout values ==="
+# ----------------------------------------------------------------------------------------
+cat > delay_cmd_timeout.sh << 'EOF'
+#!/bin/bash
+# Command that takes 5 seconds
+sleep 5
+echo "25 50"
+EOF
+chmod +x delay_cmd_timeout.sh
+
+# Test with timeout shorter than command duration
+sed -i.bak "s|delay_cmd_multi_down.sh|delay_cmd_timeout.sh|" etc/pgpool.conf
+echo "replication_delay_source_timeout = 2" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for timeout
+echo "Waiting for timeout test (2s timeout, 5s command)..."
+for i in {1..10}; do
+    if grep -q "replication delay command timed out after 2 seconds" log/pgpool.log 2>/dev/null; then
+        echo "Timeout detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for timeout message
+grep "replication delay command timed out after 2 seconds" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: timeout not detected
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: timeout test succeeded
+./shutdownall
+
+# Test with timeout longer than command duration
+sed -i.bak "s|replication_delay_source_timeout = 2|replication_delay_source_timeout = 10|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for successful execution
+echo "Waiting for successful execution (10s timeout, 5s command)..."
+for i in {1..15}; do
+    if grep -q "executing replication delay command.*delay_cmd_timeout.sh" log/pgpool.log 2>/dev/null; then
+        echo "Command executed successfully after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Should not timeout this time
+if grep -q "replication delay command timed out" log/pgpool.log 2>/dev/null; then
+    echo fail: command should not have timed out with 10s timeout
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: extended timeout test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test7: Mix of valid delays and -1 ==="
+# ----------------------------------------------------------------------------------------
+cat > delay_cmd_mixed.sh << 'EOF'
+#!/bin/bash
+# One replica up (25ms), one down (-1)
+echo "25 -1"
+EOF
+chmod +x delay_cmd_mixed.sh
+
+sed -i.bak "s|delay_cmd_timeout.sh|delay_cmd_mixed.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check
+echo "Waiting for mixed delay test..."
+for i in {1..10}; do
+    if grep -q "node.*reported as down by external command" log/pgpool.log 2>/dev/null; then
+        echo "Mixed delay handling detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Should log one -1 and process one normal delay
+grep "node.*reported as down by external command.*delay -1" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: -1 not logged
+    ./shutdownall
+    exit 1
+fi
+
+# Should also log the normal replica delay
+grep "Replication of node.*external command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo "Note: Normal replica delay logging may not be visible with log_standby_delay settings"
+fi
+
+echo ok: mixed delay handling test succeeded
+./shutdownall
+
+echo "All validation tests passed!"
+exit 0
\ No newline at end of file
-- 
2.52.0



^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-12-24 02:46  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  2 siblings, 0 replies; 61+ messages in thread

From: Tatsuo Ishii @ 2025-12-24 02:46 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

Hi Nadav,

Thank you for the updated patch. Unfortunately it continues to
fail. Looking into the code and log, it seems sr_check_worker fails at
here:

		else if (WIFSIGNALED(status))
		{
			ereport(ERROR,
					(errmsg("replication delay command terminated by signal %d: %s",
							WTERMSIG(status), command)));
		}

A strange thing is, the the signal number looks random: 6, 9, 62.... I
have no idea why this could happen.  If I change ERROR to DEBUG1, the
test sometimes succeeds and sometimes fails.  I will look into this.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

> Hi Tatsuo,
> 
> Thank you for the note.
> 
> I've removed the docker stuff. started working in an ubuntu 24 VM to match
> the setup. hopefully the results will be better, had so many issues
> compiling and testing before that stuff wasn't properly formulated.
> 
> Attaching the latest patch.
> 
> this is what i'm seeing:
> adav@lima-dev:/src/pgpool2/src/test/regression$ PGHOST=/tmp ./regress.sh -p
> /usr/bin 041.external_replication_delay
> creating pgpool-II temporary installation ...
> moving pgpool_setup to temporary installation path ...
> moving watchdog_setup to temporary installation path ...
> using pgpool-II at /src/pgpool2/src/test/regression/temp/installed
> *************************
> REGRESSION MODE          : install
> Pgpool-II version        : pgpool-II version 4.8devel (mitsukakeboshi)
> Pgpool-II install path   : /src/pgpool2/src/test/regression/temp/installed
> PostgreSQL bin           : /usr/lib/postgresql/16/bin
> PostgreSQL Major version : 16
> pgbench                  : /usr/lib/postgresql/16/bin/pgbench
> PostgreSQL jdbc          :
> /usr/local/pgsql/share/postgresql-9.2-1003.jdbc4.jar
> *************************
> testing 041.external_replication_delay...ok.
> out of 1 ok:1 failed:0 timeout:0
> 
> 
> 
> On Tue, Dec 23, 2025 at 10:46 AM Tatsuo Ishii <[email protected]> wrote:
> 
>> > Hi Tatsuo,
>> >
>> > I'km running into issues testing this and have created a full docker
>> > compose setup - can you please point me to up to date guides on the best
>> > way to run the tests so i know we're doing it the same way?
>> >
>> > Thank you for all your help!
>>
>> I have run the regression test on the Pgpool-II master branch on my
>> Ubuntu 24 box.
>>
>> cd pgpool2/src/test/regression
>> ./regress.sh 041
>>
>> This time I noticed:
>>
>> - The patch does not named with version number
>> - The patch creates .dockerignore and docker/ directory.
>>
>> Are they intended? I am asking because they are different from the
>> previous version.
>>
>> > On Tue, Dec 23, 2025 at 2:13 AM Tatsuo Ishii <[email protected]>
>> wrote:
>> >
>> >> > I think everything is passing now. new version attached.
>> >>
>> >> Unfortunately Test1 did not pass.
>> >>
>> >> === Test1: Basic external command with integer millisecond values ===
>> >> waiting for server to start....1438600 2025-12-23 09:09:48.337 JST LOG:
>> >> redirecting log output to logging collector process
>> >> 1438600 2025-12-23 09:09:48.337 JST HINT:  Future log output will appear
>> >> in directory "log".
>> >>  done
>> >> server started
>> >> waiting for server to start....1438617 2025-12-23 09:09:48.443 JST LOG:
>> >> redirecting log output to logging collector process
>> >> 1438617 2025-12-23 09:09:48.443 JST HINT:  Future log output will appear
>> >> in directory "log".
>> >>  done
>> >> server started
>> >> waiting for server to start....1438634 2025-12-23 09:09:48.561 JST LOG:
>> >> redirecting log output to logging collector process
>> >> 1438634 2025-12-23 09:09:48.561 JST HINT:  Future log output will appear
>> >> in directory "log".
>> >>  done
>> >> server started
>> >> CREATE TABLE
>> >> Waiting for sr_check to run...
>> >> Command executed after 1 seconds
>> >>  node_id | hostname  | port  | status | pg_status | lb_weight |  role
>>  |
>> >> pg_role | select_cnt | load_balance_node | replication_delay |
>> >> replication_state | replication_sync_state | last_status_change
>> >>
>> >>
>> ---------+-----------+-------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
>> >>  0       | localhost | 11002 | up     | up        | 0.333333  | primary
>> |
>> >> primary | 0          | true              | 0                 |
>> >>      |                        | 2025-12-23 09:09:49
>> >>  1       | localhost | 11003 | up     | up        | 0.333333  | standby
>> |
>> >> standby | 0          | false             | 0                 |
>> >>      |                        | 2025-12-23 09:09:49
>> >>  2       | localhost | 11004 | up     | up        | 0.333333  | standby
>> |
>> >> standby | 0          | false             | 0                 |
>> >>      |                        | 2025-12-23 09:09:49
>> >> (3 rows)
>> >>
>> >> fail: external command delay logging not found
>> >>
>> >> > On Mon, Nov 24, 2025 at 9:41 AM Tatsuo Ishii <[email protected]>
>> >> wrote:
>> >> >
>> >> >> Thank you for updating the patch! This time the patch applies without
>> >> >> any issue and compiles fine. Unfortunately regression test failed.
>> >> >>
>> >> >> testing 041.external_replication_delay...failed.
>> >> >>
>> >> >> From the regression log, it seems Test7 failed.
>> >> >>
>> >> >>
>> >>
>> ------------------------------------------------------------------------------
>> >> >> === Test7: Command timeout handling ===
>> >> >> waiting for server to start....411181 2025-11-24 16:31:05.244 JST
>> LOG:
>> >> >> redirecting log output to logging collector process
>> >> >> 411181 2025-11-24 16:31:05.244 JST HINT:  Future log output will
>> appear
>> >> in
>> >> >> directory "log".
>> >> >>  done
>> >> >> server started
>> >> >> waiting for server to start....411196 2025-11-24 16:31:05.352 JST
>> LOG:
>> >> >> redirecting log output to logging collector process
>> >> >> 411196 2025-11-24 16:31:05.352 JST HINT:  Future log output will
>> appear
>> >> in
>> >> >> directory "log".
>> >> >>  done
>> >> >> server started
>> >> >> waiting for server to start....411213 2025-11-24 16:31:05.461 JST
>> LOG:
>> >> >> redirecting log output to logging collector process
>> >> >> 411213 2025-11-24 16:31:05.461 JST HINT:  Future log output will
>> appear
>> >> in
>> >> >> directory "log".
>> >> >>  done
>> >> >> server started
>> >> >> Waiting for command timeout...
>> >> >> fail: command timeout not detected
>> >> >>
>> >> >>
>> >>
>> ------------------------------------------------------------------------------
>> >> >>
>> >> >> Attached is the pgpool.log. If you need more info, please let me
>> know.
>> >> >>
>> >> >> Best regards,
>> >> >> --
>> >> >> Tatsuo Ishii
>> >> >> SRA OSS K.K.
>> >> >> English: http://www.sraoss.co.jp/index_en/
>> >> >> Japanese:http://www.sraoss.co.jp
>> >> >>
>> >> >>
>> >> >> > Hi Tatsuo,
>> >> >> >
>> >> >> > Sorry again, this was due to the separation of 2 patches and i only
>> >> sent
>> >> >> > the one.
>> >> >> >
>> >> >> > I've merged it into 1 commit and 1 patch and rebased over master to
>> >> avoid
>> >> >> > these issues moving forward.
>> >> >> >
>> >> >> > PFA latest version
>> >> >> >
>> >> >> > On Thu, Nov 20, 2025 at 1:09 AM Tatsuo Ishii <[email protected]
>> >
>> >> >> wrote:
>> >> >> >
>> >> >> >> Hi Nadav,
>> >> >> >>
>> >> >> >> Thank you for new patch.
>> >> >> >> Unfortunately the patch did not apply to current master.
>> >> >> >>
>> >> >> >> $ git apply
>> >> >> >> ~/0001-Fix-multiple-issues-in-external-replication-delay-fe.patch
>> >> >> >> error: patch failed:
>> >> src/streaming_replication/pool_worker_child.c:694
>> >> >> >> error: src/streaming_replication/pool_worker_child.c: patch does
>> not
>> >> >> apply
>> >> >> >>
>> >> >> >> Maybe the patch is on top of your previous patch?
>> >> >> >>
>> >> >> >> Also I suggest to use "-v" option of "git format-patch" to add the
>> >> >> >> patch version number so that we can easily know which patch is the
>> >> >> >> latest.
>> >> >> >>
>> >> >> >> Best regards,
>> >> >> >> --
>> >> >> >> Tatsuo Ishii
>> >> >> >> SRA OSS K.K.
>> >> >> >> English: http://www.sraoss.co.jp/index_en/
>> >> >> >> Japanese:http://www.sraoss.co.jp
>> >> >> >>
>> >> >> >> > Hi Tatsuo,
>> >> >> >> >
>> >> >> >> > Please see attached an updated version.
>> >> >> >> >
>> >> >> >> > thank you
>> >> >> >> >
>> >> >> >> > On Fri, Nov 7, 2025 at 2:07 AM Tatsuo Ishii <
>> [email protected]>
>> >> >> >> wrote:
>> >> >> >> >
>> >> >> >> >> > Sorry for that - thanks for the patch.
>> >> >> >> >> >
>> >> >> >> >> > Please find attached a new version
>> >> >> >> >>
>> >> >> >> >> Thanks for the new version. Unfortunately this time regression
>> >> test
>> >> >> >> >> fails at:
>> >> >> >> >>
>> >> >> >> >> > Waiting for command timeout...
>> >> >> >> >> > fail: command timeout not detected
>> >> >> >> >>
>> >> >> >> >> Attached is the pgpool.log.
>> >> >> >> >>
>> >> >> >> >> Best regards,
>> >> >> >> >> --
>> >> >> >> >> Tatsuo Ishii
>> >> >> >> >> SRA OSS K.K.
>> >> >> >> >> English: http://www.sraoss.co.jp/index_en/
>> >> >> >> >> Japanese:http://www.sraoss.co.jp
>> >> >> >> >>
>> >> >> >> >> > On Mon, Nov 3, 2025 at 9:05 AM Tatsuo Ishii <
>> >> [email protected]>
>> >> >> >> >> wrote:
>> >> >> >> >> >
>> >> >> >> >> >> > thanks and sorry for the issues, please find attached
>> updated
>> >> >> >> version.
>> >> >> >> >> >>
>> >> >> >> >> >> No problem.
>> >> >> >> >> >>
>> >> >> >> >> >> This time the patch applies fine, no compiler warnings.
>> >> However,
>> >> >> >> >> >> regression test did not passed here (on Ubuntu 24 LTS if
>> this
>> >> >> >> >> >> matters).  So I looked into
>> >> >> >> >> >>
>> >> src/test/regression/tests/041.external_replication_delay/test.sh a
>> >> >> >> >> >> little bit and apply attached patch (test.sh.patch). It
>> moved
>> >> >> forward
>> >> >> >> >> >> partially but failed at:
>> >> >> >> >> >>
>> >> >> >> >> >> fail: command execution failure not detected
>> >> >> >> >> >>
>> >> >> >> >> >> Please find attached
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> src/test/regression/tests/041.external_replication_delay/testdir/pgpool.log
>> >> >> >> >> >> and src/test/regression/log/041.external_replication_delay.
>> >> >> >> >> >>
>> >> >> >> >> >> Best regards,
>> >> >> >> >> >> --
>> >> >> >> >> >> Tatsuo Ishii
>> >> >> >> >> >> SRA OSS K.K.
>> >> >> >> >> >> English: http://www.sraoss.co.jp/index_en/
>> >> >> >> >> >> Japanese:http://www.sraoss.co.jp
>> >> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> > --
>> >> >> >> >> > Nadav Shatz
>> >> >> >> >> > Tailor Brands | CTO
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > --
>> >> >> >> > Nadav Shatz
>> >> >> >> > Tailor Brands | CTO
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Nadav Shatz
>> >> >> > Tailor Brands | CTO
>> >> >>
>> >> >
>> >> >
>> >> > --
>> >> > Nadav Shatz
>> >> > Tailor Brands | CTO
>> >>
>> >
>> >
>> > --
>> > Nadav Shatz
>> > Tailor Brands | CTO
>>
> 
> 
> -- 
> Nadav Shatz
> Tailor Brands | CTO


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-12-26 07:15  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  2 siblings, 0 replies; 61+ messages in thread

From: Tatsuo Ishii @ 2025-12-26 07:15 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

Hi Nadav,

I think I found a cause of the problem. On Linux, if SIGCHLD is
ignored (set to SIG_IGN), waitpid() cannot get proper child status.
Because the kernel relcaims the resource for the child process to not
make the child process a zombie. And this makes waitpid() to fail with
ECHLD. Since the return of waitpid() is not checked, I did not notice
the waitpid() failure (I recommend to check the return value of
waitpid()).

	/* set up signal handlers */
	signal(SIGALRM, SIG_DFL);
	signal(SIGTERM, my_signal_handler);
	signal(SIGINT, my_signal_handler);
	signal(SIGHUP, reload_config_handler);
	signal(SIGQUIT, my_signal_handler);
	signal(SIGCHLD, SIG_IGN);	<--- SIGCHLD is ignored
	signal(SIGUSR1, my_signal_handler);
	signal(SIGUSR2, SIG_IGN);
	signal(SIGPIPE, SIG_IGN);

To fix this, either change the line above to:

	signal(SIGCHLD, SIG_DFL);
or
	signal(SIGCHLD, my_signal_handler);
	and modify my_signal_handler.

I recommend the latter, because it does not depend on the default
behavior of SIGCHLD, which might be different per platform.
Attached is the patch to do this. (and run pgindent).
I also notice that something like:

		/* Count tokens in output for validation */
		char *line_copy = pstrdup(line);
		char *temp_token = strtok(line_copy, " \t\n");

You should declare line_copy and temp_token in the begging of the code
block (or in the outer block).  The forward declaration is recommended
coding style in Pgpool-II (and PostgreSQL). Same thing can be said to
some other variables.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp


> Hi Tatsuo,
> 
> Thank you for the note.
> 
> I've removed the docker stuff. started working in an ubuntu 24 VM to match
> the setup. hopefully the results will be better, had so many issues
> compiling and testing before that stuff wasn't properly formulated.
> 
> Attaching the latest patch.
> 
> this is what i'm seeing:
> adav@lima-dev:/src/pgpool2/src/test/regression$ PGHOST=/tmp ./regress.sh -p
> /usr/bin 041.external_replication_delay
> creating pgpool-II temporary installation ...
> moving pgpool_setup to temporary installation path ...
> moving watchdog_setup to temporary installation path ...
> using pgpool-II at /src/pgpool2/src/test/regression/temp/installed
> *************************
> REGRESSION MODE          : install
> Pgpool-II version        : pgpool-II version 4.8devel (mitsukakeboshi)
> Pgpool-II install path   : /src/pgpool2/src/test/regression/temp/installed
> PostgreSQL bin           : /usr/lib/postgresql/16/bin
> PostgreSQL Major version : 16
> pgbench                  : /usr/lib/postgresql/16/bin/pgbench
> PostgreSQL jdbc          :
> /usr/local/pgsql/share/postgresql-9.2-1003.jdbc4.jar
> *************************
> testing 041.external_replication_delay...ok.
> out of 1 ok:1 failed:0 timeout:0
> 
> 
> 
> On Tue, Dec 23, 2025 at 10:46 AM Tatsuo Ishii <[email protected]> wrote:
> 
>> > Hi Tatsuo,
>> >
>> > I'km running into issues testing this and have created a full docker
>> > compose setup - can you please point me to up to date guides on the best
>> > way to run the tests so i know we're doing it the same way?
>> >
>> > Thank you for all your help!
>>
>> I have run the regression test on the Pgpool-II master branch on my
>> Ubuntu 24 box.
>>
>> cd pgpool2/src/test/regression
>> ./regress.sh 041
>>
>> This time I noticed:
>>
>> - The patch does not named with version number
>> - The patch creates .dockerignore and docker/ directory.
>>
>> Are they intended? I am asking because they are different from the
>> previous version.
>>
>> > On Tue, Dec 23, 2025 at 2:13 AM Tatsuo Ishii <[email protected]>
>> wrote:
>> >
>> >> > I think everything is passing now. new version attached.
>> >>
>> >> Unfortunately Test1 did not pass.
>> >>
>> >> === Test1: Basic external command with integer millisecond values ===
>> >> waiting for server to start....1438600 2025-12-23 09:09:48.337 JST LOG:
>> >> redirecting log output to logging collector process
>> >> 1438600 2025-12-23 09:09:48.337 JST HINT:  Future log output will appear
>> >> in directory "log".
>> >>  done
>> >> server started
>> >> waiting for server to start....1438617 2025-12-23 09:09:48.443 JST LOG:
>> >> redirecting log output to logging collector process
>> >> 1438617 2025-12-23 09:09:48.443 JST HINT:  Future log output will appear
>> >> in directory "log".
>> >>  done
>> >> server started
>> >> waiting for server to start....1438634 2025-12-23 09:09:48.561 JST LOG:
>> >> redirecting log output to logging collector process
>> >> 1438634 2025-12-23 09:09:48.561 JST HINT:  Future log output will appear
>> >> in directory "log".
>> >>  done
>> >> server started
>> >> CREATE TABLE
>> >> Waiting for sr_check to run...
>> >> Command executed after 1 seconds
>> >>  node_id | hostname  | port  | status | pg_status | lb_weight |  role
>>  |
>> >> pg_role | select_cnt | load_balance_node | replication_delay |
>> >> replication_state | replication_sync_state | last_status_change
>> >>
>> >>
>> ---------+-----------+-------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
>> >>  0       | localhost | 11002 | up     | up        | 0.333333  | primary
>> |
>> >> primary | 0          | true              | 0                 |
>> >>      |                        | 2025-12-23 09:09:49
>> >>  1       | localhost | 11003 | up     | up        | 0.333333  | standby
>> |
>> >> standby | 0          | false             | 0                 |
>> >>      |                        | 2025-12-23 09:09:49
>> >>  2       | localhost | 11004 | up     | up        | 0.333333  | standby
>> |
>> >> standby | 0          | false             | 0                 |
>> >>      |                        | 2025-12-23 09:09:49
>> >> (3 rows)
>> >>
>> >> fail: external command delay logging not found
>> >>
>> >> > On Mon, Nov 24, 2025 at 9:41 AM Tatsuo Ishii <[email protected]>
>> >> wrote:
>> >> >
>> >> >> Thank you for updating the patch! This time the patch applies without
>> >> >> any issue and compiles fine. Unfortunately regression test failed.
>> >> >>
>> >> >> testing 041.external_replication_delay...failed.
>> >> >>
>> >> >> From the regression log, it seems Test7 failed.
>> >> >>
>> >> >>
>> >>
>> ------------------------------------------------------------------------------
>> >> >> === Test7: Command timeout handling ===
>> >> >> waiting for server to start....411181 2025-11-24 16:31:05.244 JST
>> LOG:
>> >> >> redirecting log output to logging collector process
>> >> >> 411181 2025-11-24 16:31:05.244 JST HINT:  Future log output will
>> appear
>> >> in
>> >> >> directory "log".
>> >> >>  done
>> >> >> server started
>> >> >> waiting for server to start....411196 2025-11-24 16:31:05.352 JST
>> LOG:
>> >> >> redirecting log output to logging collector process
>> >> >> 411196 2025-11-24 16:31:05.352 JST HINT:  Future log output will
>> appear
>> >> in
>> >> >> directory "log".
>> >> >>  done
>> >> >> server started
>> >> >> waiting for server to start....411213 2025-11-24 16:31:05.461 JST
>> LOG:
>> >> >> redirecting log output to logging collector process
>> >> >> 411213 2025-11-24 16:31:05.461 JST HINT:  Future log output will
>> appear
>> >> in
>> >> >> directory "log".
>> >> >>  done
>> >> >> server started
>> >> >> Waiting for command timeout...
>> >> >> fail: command timeout not detected
>> >> >>
>> >> >>
>> >>
>> ------------------------------------------------------------------------------
>> >> >>
>> >> >> Attached is the pgpool.log. If you need more info, please let me
>> know.
>> >> >>
>> >> >> Best regards,
>> >> >> --
>> >> >> Tatsuo Ishii
>> >> >> SRA OSS K.K.
>> >> >> English: http://www.sraoss.co.jp/index_en/
>> >> >> Japanese:http://www.sraoss.co.jp
>> >> >>
>> >> >>
>> >> >> > Hi Tatsuo,
>> >> >> >
>> >> >> > Sorry again, this was due to the separation of 2 patches and i only
>> >> sent
>> >> >> > the one.
>> >> >> >
>> >> >> > I've merged it into 1 commit and 1 patch and rebased over master to
>> >> avoid
>> >> >> > these issues moving forward.
>> >> >> >
>> >> >> > PFA latest version
>> >> >> >
>> >> >> > On Thu, Nov 20, 2025 at 1:09 AM Tatsuo Ishii <[email protected]
>> >
>> >> >> wrote:
>> >> >> >
>> >> >> >> Hi Nadav,
>> >> >> >>
>> >> >> >> Thank you for new patch.
>> >> >> >> Unfortunately the patch did not apply to current master.
>> >> >> >>
>> >> >> >> $ git apply
>> >> >> >> ~/0001-Fix-multiple-issues-in-external-replication-delay-fe.patch
>> >> >> >> error: patch failed:
>> >> src/streaming_replication/pool_worker_child.c:694
>> >> >> >> error: src/streaming_replication/pool_worker_child.c: patch does
>> not
>> >> >> apply
>> >> >> >>
>> >> >> >> Maybe the patch is on top of your previous patch?
>> >> >> >>
>> >> >> >> Also I suggest to use "-v" option of "git format-patch" to add the
>> >> >> >> patch version number so that we can easily know which patch is the
>> >> >> >> latest.
>> >> >> >>
>> >> >> >> Best regards,
>> >> >> >> --
>> >> >> >> Tatsuo Ishii
>> >> >> >> SRA OSS K.K.
>> >> >> >> English: http://www.sraoss.co.jp/index_en/
>> >> >> >> Japanese:http://www.sraoss.co.jp
>> >> >> >>
>> >> >> >> > Hi Tatsuo,
>> >> >> >> >
>> >> >> >> > Please see attached an updated version.
>> >> >> >> >
>> >> >> >> > thank you
>> >> >> >> >
>> >> >> >> > On Fri, Nov 7, 2025 at 2:07 AM Tatsuo Ishii <
>> [email protected]>
>> >> >> >> wrote:
>> >> >> >> >
>> >> >> >> >> > Sorry for that - thanks for the patch.
>> >> >> >> >> >
>> >> >> >> >> > Please find attached a new version
>> >> >> >> >>
>> >> >> >> >> Thanks for the new version. Unfortunately this time regression
>> >> test
>> >> >> >> >> fails at:
>> >> >> >> >>
>> >> >> >> >> > Waiting for command timeout...
>> >> >> >> >> > fail: command timeout not detected
>> >> >> >> >>
>> >> >> >> >> Attached is the pgpool.log.
>> >> >> >> >>
>> >> >> >> >> Best regards,
>> >> >> >> >> --
>> >> >> >> >> Tatsuo Ishii
>> >> >> >> >> SRA OSS K.K.
>> >> >> >> >> English: http://www.sraoss.co.jp/index_en/
>> >> >> >> >> Japanese:http://www.sraoss.co.jp
>> >> >> >> >>
>> >> >> >> >> > On Mon, Nov 3, 2025 at 9:05 AM Tatsuo Ishii <
>> >> [email protected]>
>> >> >> >> >> wrote:
>> >> >> >> >> >
>> >> >> >> >> >> > thanks and sorry for the issues, please find attached
>> updated
>> >> >> >> version.
>> >> >> >> >> >>
>> >> >> >> >> >> No problem.
>> >> >> >> >> >>
>> >> >> >> >> >> This time the patch applies fine, no compiler warnings.
>> >> However,
>> >> >> >> >> >> regression test did not passed here (on Ubuntu 24 LTS if
>> this
>> >> >> >> >> >> matters).  So I looked into
>> >> >> >> >> >>
>> >> src/test/regression/tests/041.external_replication_delay/test.sh a
>> >> >> >> >> >> little bit and apply attached patch (test.sh.patch). It
>> moved
>> >> >> forward
>> >> >> >> >> >> partially but failed at:
>> >> >> >> >> >>
>> >> >> >> >> >> fail: command execution failure not detected
>> >> >> >> >> >>
>> >> >> >> >> >> Please find attached
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> src/test/regression/tests/041.external_replication_delay/testdir/pgpool.log
>> >> >> >> >> >> and src/test/regression/log/041.external_replication_delay.
>> >> >> >> >> >>
>> >> >> >> >> >> Best regards,
>> >> >> >> >> >> --
>> >> >> >> >> >> Tatsuo Ishii
>> >> >> >> >> >> SRA OSS K.K.
>> >> >> >> >> >> English: http://www.sraoss.co.jp/index_en/
>> >> >> >> >> >> Japanese:http://www.sraoss.co.jp
>> >> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> > --
>> >> >> >> >> > Nadav Shatz
>> >> >> >> >> > Tailor Brands | CTO
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > --
>> >> >> >> > Nadav Shatz
>> >> >> >> > Tailor Brands | CTO
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Nadav Shatz
>> >> >> > Tailor Brands | CTO
>> >> >>
>> >> >
>> >> >
>> >> > --
>> >> > Nadav Shatz
>> >> > Tailor Brands | CTO
>> >>
>> >
>> >
>> > --
>> > Nadav Shatz
>> > Tailor Brands | CTO
>>
> 
> 
> -- 
> Nadav Shatz
> Tailor Brands | CTO


Attachments:

  [text/x-patch] extenal_delay_cmd.patch (8.2K, 2-extenal_delay_cmd.patch)
  download | inline diff:
diff --git a/src/streaming_replication/pool_worker_child.c b/src/streaming_replication/pool_worker_child.c
index 457d0fab0..c509ba5bc 100644
--- a/src/streaming_replication/pool_worker_child.c
+++ b/src/streaming_replication/pool_worker_child.c
@@ -132,7 +132,7 @@ do_worker_child(void *params)
 	signal(SIGINT, my_signal_handler);
 	signal(SIGHUP, reload_config_handler);
 	signal(SIGQUIT, my_signal_handler);
-	signal(SIGCHLD, SIG_IGN);
+	signal(SIGCHLD, my_signal_handler);
 	signal(SIGUSR1, my_signal_handler);
 	signal(SIGUSR2, SIG_IGN);
 	signal(SIGPIPE, SIG_IGN);
@@ -262,16 +262,20 @@ do_worker_child(void *params)
 					POOL_NODE_STATUS *node_status;
 					int			i;
 
-			/* Do replication time lag checking */
-			/* Use external command if replication_delay_source_cmd is configured */
-			if (pool_config->replication_delay_source_cmd &&
-				strlen(pool_config->replication_delay_source_cmd) > 0)
-				check_replication_time_lag_with_cmd();
-			else
-				check_replication_time_lag();
+					/* Do replication time lag checking */
 
-			/* Check node status */
-			node_status = verify_backend_node_status(slots);
+					/*
+					 * Use external command if replication_delay_source_cmd is
+					 * configured
+					 */
+					if (pool_config->replication_delay_source_cmd &&
+						strlen(pool_config->replication_delay_source_cmd) > 0)
+						check_replication_time_lag_with_cmd();
+					else
+						check_replication_time_lag();
+
+					/* Check node status */
+					node_status = verify_backend_node_status(slots);
 
 
 					for (i = 0; i < NUM_BACKENDS; i++)
@@ -668,7 +672,7 @@ check_replication_time_lag(void)
 }
 
 #define MAX_CMD_OUTPUT 4096
-#define MAX_REASONABLE_DELAY_MS 3600000.0  /* 1 hour in milliseconds */
+#define MAX_REASONABLE_DELAY_MS 3600000.0	/* 1 hour in milliseconds */
 
 /*
  * Check replication time lag using external command
@@ -680,23 +684,23 @@ check_replication_time_lag(void)
 static void
 check_replication_time_lag_with_cmd(void)
 {
-	char		   *command = NULL;
-	char		   *line;
-	char		   *token;
-	char		   *saveptr;
-	double			delay_ms;
-	uint64			delay;
-	int				token_count = 0;
-	BackendInfo	   *bkinfo;
+	char	   *command = NULL;
+	char	   *line;
+	char	   *token;
+	char	   *saveptr;
+	double		delay_ms;
+	uint64		delay;
+	int			token_count = 0;
+	BackendInfo *bkinfo;
 	ErrorContextCallback callback;
-	int				pipefd[2] = {-1, -1};
-	pid_t			pid = -1;
-	int				ret;
-	struct timeval	timeout;
-	fd_set			readfds;
-	ssize_t			bytes_read;
-	int				status;
-	int				num_replicas;
+	int			pipefd[2] = {-1, -1};
+	pid_t		pid = -1;
+	int			ret;
+	struct timeval timeout;
+	fd_set		readfds;
+	ssize_t		bytes_read;
+	int			status;
+	int			num_replicas;
 
 	if (NUM_BACKENDS <= 1)
 	{
@@ -717,7 +721,7 @@ check_replication_time_lag_with_cmd(void)
 	}
 
 	/* Capture primary node ID to avoid race conditions during execution */
-	int primary_node_id = REAL_PRIMARY_NODE_ID;
+	int			primary_node_id = REAL_PRIMARY_NODE_ID;
 
 	if (!pool_config->replication_delay_source_cmd ||
 		strlen(pool_config->replication_delay_source_cmd) == 0)
@@ -746,16 +750,21 @@ check_replication_time_lag_with_cmd(void)
 	PG_TRY();
 	{
 		const char *base_command = pool_config->replication_delay_source_cmd;
-		size_t total_len = strlen(base_command) + 1; /* +1 for NUL */
+		size_t		total_len = strlen(base_command) + 1;	/* +1 for NUL */
 
 		/* Build command with replica-only arguments (omit primary) */
-		/* Calculate total command length including space-separated replica identifiers */
+
+		/*
+		 * Calculate total command length including space-separated replica
+		 * identifiers
+		 */
 		for (int i = 0; i < NUM_BACKENDS; i++)
 		{
 			if (i == primary_node_id)
-				continue; /* Skip primary node */
+				continue;		/* Skip primary node */
+
+			char	   *ident = build_instance_identifier_for_node(i);
 
-			char *ident = build_instance_identifier_for_node(i);
 			total_len += 1 /* space */ + strlen(ident);
 			pfree(ident);
 		}
@@ -764,13 +773,14 @@ check_replication_time_lag_with_cmd(void)
 		strlcpy(command, base_command, total_len);
 
 		/* Append replica identifiers */
-		size_t current_len = strlen(command);
+		size_t		current_len = strlen(command);
+
 		for (int i = 0; i < NUM_BACKENDS; i++)
 		{
 			if (i == primary_node_id)
-				continue; /* Skip primary node */
+				continue;		/* Skip primary node */
 
-			char *ident = build_instance_identifier_for_node(i);
+			char	   *ident = build_instance_identifier_for_node(i);
 
 			/* Append space and identifier */
 			snprintf(command + current_len, total_len - current_len, " %s", ident);
@@ -800,16 +810,16 @@ check_replication_time_lag_with_cmd(void)
 		if (pid == 0)
 		{
 			/* Child process */
-			close(pipefd[0]); /* Close read end */
+			close(pipefd[0]);	/* Close read end */
 			if (dup2(pipefd[1], STDOUT_FILENO) == -1)
 			{
 				fprintf(stderr, "dup2 failed: %s\n", strerror(errno));
 				exit(1);
 			}
-			close(pipefd[1]); /* Close write end (duplicated to stdout) */
+			close(pipefd[1]);	/* Close write end (duplicated to stdout) */
 
 			/* Execute command using shell */
-			execl("/bin/sh", "sh", "-c", command, (char *)NULL);
+			execl("/bin/sh", "sh", "-c", command, (char *) NULL);
 
 			/* If execl fails */
 			fprintf(stderr, "execl failed: %s\n", strerror(errno));
@@ -817,7 +827,7 @@ check_replication_time_lag_with_cmd(void)
 		}
 
 		/* Parent process */
-		close(pipefd[1]); /* Close write end */
+		close(pipefd[1]);		/* Close write end */
 		pipefd[1] = -1;
 
 		/* Set up timeout for select */
@@ -832,7 +842,8 @@ check_replication_time_lag_with_cmd(void)
 
 		if (ret == -1)
 		{
-			int save_errno = errno;
+			int			save_errno = errno;
+
 			kill(pid, SIGKILL);
 			waitpid(pid, NULL, 0);
 			pid = -1;
@@ -913,11 +924,12 @@ check_replication_time_lag_with_cmd(void)
 		bkinfo->standby_delay_by_time = true;
 
 		/* Count expected replicas */
-		num_replicas = NUM_BACKENDS - 1; /* Total nodes minus primary */
+		num_replicas = NUM_BACKENDS - 1;	/* Total nodes minus primary */
 
 		/* Count tokens in output for validation */
-		char *line_copy = pstrdup(line);
-		char *temp_token = strtok(line_copy, " \t\n");
+		char	   *line_copy = pstrdup(line);
+		char	   *temp_token = strtok(line_copy, " \t\n");
+
 		while (temp_token != NULL)
 		{
 			token_count++;
@@ -953,7 +965,7 @@ check_replication_time_lag_with_cmd(void)
 		for (int i = 0; i < NUM_BACKENDS && token != NULL; i++)
 		{
 			if (i == primary_node_id)
-				continue; /* Skip primary - it's not in the output */
+				continue;		/* Skip primary - it's not in the output */
 
 			if (!VALID_BACKEND(i))
 			{
@@ -962,7 +974,8 @@ check_replication_time_lag_with_cmd(void)
 				continue;
 			}
 
-			char *endptr;
+			char	   *endptr;
+
 			delay_ms = strtod(token, &endptr);
 
 			/* Validate the conversion */
@@ -1002,13 +1015,18 @@ check_replication_time_lag_with_cmd(void)
 								delay_ms, i)));
 			}
 
-			/* Convert delay from milliseconds to microseconds for internal storage */
-			delay = (uint64)(delay_ms * 1000);
+			/*
+			 * Convert delay from milliseconds to microseconds for internal
+			 * storage
+			 */
+			delay = (uint64) (delay_ms * 1000);
 			bkinfo->standby_delay = delay;
 			bkinfo->standby_delay_by_time = true;
 
 			/* Log delay if necessary */
-			uint64 delay_threshold_by_time = pool_config->delay_threshold_by_time * 1000; /* threshold is in milliseconds, convert to microseconds */
+			uint64		delay_threshold_by_time = pool_config->delay_threshold_by_time * 1000;	/* threshold is in
+																								 * milliseconds, convert
+																								 * to microseconds */
 
 			if ((pool_config->log_standby_delay == LSD_ALWAYS && delay_ms > 0) ||
 				(pool_config->log_standby_delay == LSD_OVER_THRESHOLD &&
@@ -1026,12 +1044,15 @@ check_replication_time_lag_with_cmd(void)
 	PG_CATCH();
 	{
 		/* Cleanup in case of error */
-		if (pid > 0) {
+		if (pid > 0)
+		{
 			kill(pid, SIGKILL);
 			waitpid(pid, NULL, 0);
 		}
-		if (pipefd[0] != -1) close(pipefd[0]);
-		if (pipefd[1] != -1) close(pipefd[1]);
+		if (pipefd[0] != -1)
+			close(pipefd[0]);
+		if (pipefd[1] != -1)
+			close(pipefd[1]);
 
 		if (line)
 			pfree(line);
@@ -1137,6 +1158,9 @@ static RETSIGTYPE my_signal_handler(int sig)
 			restart_request = 1;
 			break;
 
+		case SIGCHLD:
+			break;
+
 		default:
 			exit(1);
 			break;


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-12-26 07:54  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  2 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2025-12-26 07:54 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

Hi Nadav,

(Please disregard previous mail. I seem to have mangled the message).

I think I found a cause of the problem. On Linux, if SIGCHLD is
ignored (set to SIG_IGN), waitpid() cannot get proper child status.
Because the kernel relcaims the resource for the child process to not
make the child process a zombie. And this makes waitpid() to fail with
ECHLD. Since the return of waitpid() is not checked, I did not notice
the waitpid() failure (I recommend to check the return value of
waitpid()).

	/* set up signal handlers */
	signal(SIGALRM, SIG_DFL);
	signal(SIGTERM, my_signal_handler);
	signal(SIGINT, my_signal_handler);
	signal(SIGHUP, reload_config_handler);
	signal(SIGQUIT, my_signal_handler);
	signal(SIGCHLD, SIG_IGN);	<--- SIGCHLD is ignored
	signal(SIGUSR1, my_signal_handler);
	signal(SIGUSR2, SIG_IGN);
	signal(SIGPIPE, SIG_IGN);

To fix this, either change the line above to:

	signal(SIGCHLD, SIG_DFL);
or
	signal(SIGCHLD, my_signal_handler);
	and modify my_signal_handler.

I recommend the latter, because it does not depend on the default
behavior of SIGCHLD, which might be different per platform.
Attached is the patch to do this. (and run pgindent).
I also notice that something like:

		/* Count tokens in output for validation */
		char *line_copy = pstrdup(line);
		char *temp_token = strtok(line_copy, " \t\n");

You should declare line_copy and temp_token in the begging of the code
block (or in the outer block).  The forward declaration is recommended
coding style in Pgpool-II (and PostgreSQL). Same thing can be said to
some other variables.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

> Hi Tatsuo,
> 
> Thank you for the note.
> 
> I've removed the docker stuff. started working in an ubuntu 24 VM to match
> the setup. hopefully the results will be better, had so many issues
> compiling and testing before that stuff wasn't properly formulated.
> 
> Attaching the latest patch.
> 
> this is what i'm seeing:
> adav@lima-dev:/src/pgpool2/src/test/regression$ PGHOST=/tmp ./regress.sh -p
> /usr/bin 041.external_replication_delay
> creating pgpool-II temporary installation ...
> moving pgpool_setup to temporary installation path ...
> moving watchdog_setup to temporary installation path ...
> using pgpool-II at /src/pgpool2/src/test/regression/temp/installed
> *************************
> REGRESSION MODE          : install
> Pgpool-II version        : pgpool-II version 4.8devel (mitsukakeboshi)
> Pgpool-II install path   : /src/pgpool2/src/test/regression/temp/installed
> PostgreSQL bin           : /usr/lib/postgresql/16/bin
> PostgreSQL Major version : 16
> pgbench                  : /usr/lib/postgresql/16/bin/pgbench
> PostgreSQL jdbc          :
> /usr/local/pgsql/share/postgresql-9.2-1003.jdbc4.jar
> *************************
> testing 041.external_replication_delay...ok.
> out of 1 ok:1 failed:0 timeout:0
> 
> 
> 
> On Tue, Dec 23, 2025 at 10:46 AM Tatsuo Ishii <[email protected]> wrote:
> 
>> > Hi Tatsuo,
>> >
>> > I'km running into issues testing this and have created a full docker
>> > compose setup - can you please point me to up to date guides on the best
>> > way to run the tests so i know we're doing it the same way?
>> >
>> > Thank you for all your help!
>>
>> I have run the regression test on the Pgpool-II master branch on my
>> Ubuntu 24 box.
>>
>> cd pgpool2/src/test/regression
>> ./regress.sh 041
>>
>> This time I noticed:
>>
>> - The patch does not named with version number
>> - The patch creates .dockerignore and docker/ directory.
>>
>> Are they intended? I am asking because they are different from the
>> previous version.
>>
>> > On Tue, Dec 23, 2025 at 2:13 AM Tatsuo Ishii <[email protected]>
>> wrote:
>> >
>> >> > I think everything is passing now. new version attached.
>> >>
>> >> Unfortunately Test1 did not pass.
>> >>
>> >> === Test1: Basic external command with integer millisecond values ===
>> >> waiting for server to start....1438600 2025-12-23 09:09:48.337 JST LOG:
>> >> redirecting log output to logging collector process
>> >> 1438600 2025-12-23 09:09:48.337 JST HINT:  Future log output will appear
>> >> in directory "log".
>> >>  done
>> >> server started
>> >> waiting for server to start....1438617 2025-12-23 09:09:48.443 JST LOG:
>> >> redirecting log output to logging collector process
>> >> 1438617 2025-12-23 09:09:48.443 JST HINT:  Future log output will appear
>> >> in directory "log".
>> >>  done
>> >> server started
>> >> waiting for server to start....1438634 2025-12-23 09:09:48.561 JST LOG:
>> >> redirecting log output to logging collector process
>> >> 1438634 2025-12-23 09:09:48.561 JST HINT:  Future log output will appear
>> >> in directory "log".
>> >>  done
>> >> server started
>> >> CREATE TABLE
>> >> Waiting for sr_check to run...
>> >> Command executed after 1 seconds
>> >>  node_id | hostname  | port  | status | pg_status | lb_weight |  role
>>  |
>> >> pg_role | select_cnt | load_balance_node | replication_delay |
>> >> replication_state | replication_sync_state | last_status_change
>> >>
>> >>
>> ---------+-----------+-------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
>> >>  0       | localhost | 11002 | up     | up        | 0.333333  | primary
>> |
>> >> primary | 0          | true              | 0                 |
>> >>      |                        | 2025-12-23 09:09:49
>> >>  1       | localhost | 11003 | up     | up        | 0.333333  | standby
>> |
>> >> standby | 0          | false             | 0                 |
>> >>      |                        | 2025-12-23 09:09:49
>> >>  2       | localhost | 11004 | up     | up        | 0.333333  | standby
>> |
>> >> standby | 0          | false             | 0                 |
>> >>      |                        | 2025-12-23 09:09:49
>> >> (3 rows)
>> >>
>> >> fail: external command delay logging not found
>> >>
>> >> > On Mon, Nov 24, 2025 at 9:41 AM Tatsuo Ishii <[email protected]>
>> >> wrote:
>> >> >
>> >> >> Thank you for updating the patch! This time the patch applies without
>> >> >> any issue and compiles fine. Unfortunately regression test failed.
>> >> >>
>> >> >> testing 041.external_replication_delay...failed.
>> >> >>
>> >> >> From the regression log, it seems Test7 failed.
>> >> >>
>> >> >>
>> >>
>> ------------------------------------------------------------------------------
>> >> >> === Test7: Command timeout handling ===
>> >> >> waiting for server to start....411181 2025-11-24 16:31:05.244 JST
>> LOG:
>> >> >> redirecting log output to logging collector process
>> >> >> 411181 2025-11-24 16:31:05.244 JST HINT:  Future log output will
>> appear
>> >> in
>> >> >> directory "log".
>> >> >>  done
>> >> >> server started
>> >> >> waiting for server to start....411196 2025-11-24 16:31:05.352 JST
>> LOG:
>> >> >> redirecting log output to logging collector process
>> >> >> 411196 2025-11-24 16:31:05.352 JST HINT:  Future log output will
>> appear
>> >> in
>> >> >> directory "log".
>> >> >>  done
>> >> >> server started
>> >> >> waiting for server to start....411213 2025-11-24 16:31:05.461 JST
>> LOG:
>> >> >> redirecting log output to logging collector process
>> >> >> 411213 2025-11-24 16:31:05.461 JST HINT:  Future log output will
>> appear
>> >> in
>> >> >> directory "log".
>> >> >>  done
>> >> >> server started
>> >> >> Waiting for command timeout...
>> >> >> fail: command timeout not detected
>> >> >>
>> >> >>
>> >>
>> ------------------------------------------------------------------------------
>> >> >>
>> >> >> Attached is the pgpool.log. If you need more info, please let me
>> know.
>> >> >>
>> >> >> Best regards,
>> >> >> --
>> >> >> Tatsuo Ishii
>> >> >> SRA OSS K.K.
>> >> >> English: http://www.sraoss.co.jp/index_en/
>> >> >> Japanese:http://www.sraoss.co.jp
>> >> >>
>> >> >>
>> >> >> > Hi Tatsuo,
>> >> >> >
>> >> >> > Sorry again, this was due to the separation of 2 patches and i only
>> >> sent
>> >> >> > the one.
>> >> >> >
>> >> >> > I've merged it into 1 commit and 1 patch and rebased over master to
>> >> avoid
>> >> >> > these issues moving forward.
>> >> >> >
>> >> >> > PFA latest version
>> >> >> >
>> >> >> > On Thu, Nov 20, 2025 at 1:09 AM Tatsuo Ishii <[email protected]
>> >
>> >> >> wrote:
>> >> >> >
>> >> >> >> Hi Nadav,
>> >> >> >>
>> >> >> >> Thank you for new patch.
>> >> >> >> Unfortunately the patch did not apply to current master.
>> >> >> >>
>> >> >> >> $ git apply
>> >> >> >> ~/0001-Fix-multiple-issues-in-external-replication-delay-fe.patch
>> >> >> >> error: patch failed:
>> >> src/streaming_replication/pool_worker_child.c:694
>> >> >> >> error: src/streaming_replication/pool_worker_child.c: patch does
>> not
>> >> >> apply
>> >> >> >>
>> >> >> >> Maybe the patch is on top of your previous patch?
>> >> >> >>
>> >> >> >> Also I suggest to use "-v" option of "git format-patch" to add the
>> >> >> >> patch version number so that we can easily know which patch is the
>> >> >> >> latest.
>> >> >> >>
>> >> >> >> Best regards,
>> >> >> >> --
>> >> >> >> Tatsuo Ishii
>> >> >> >> SRA OSS K.K.
>> >> >> >> English: http://www.sraoss.co.jp/index_en/
>> >> >> >> Japanese:http://www.sraoss.co.jp
>> >> >> >>
>> >> >> >> > Hi Tatsuo,
>> >> >> >> >
>> >> >> >> > Please see attached an updated version.
>> >> >> >> >
>> >> >> >> > thank you
>> >> >> >> >
>> >> >> >> > On Fri, Nov 7, 2025 at 2:07 AM Tatsuo Ishii <
>> [email protected]>
>> >> >> >> wrote:
>> >> >> >> >
>> >> >> >> >> > Sorry for that - thanks for the patch.
>> >> >> >> >> >
>> >> >> >> >> > Please find attached a new version
>> >> >> >> >>
>> >> >> >> >> Thanks for the new version. Unfortunately this time regression
>> >> test
>> >> >> >> >> fails at:
>> >> >> >> >>
>> >> >> >> >> > Waiting for command timeout...
>> >> >> >> >> > fail: command timeout not detected
>> >> >> >> >>
>> >> >> >> >> Attached is the pgpool.log.
>> >> >> >> >>
>> >> >> >> >> Best regards,
>> >> >> >> >> --
>> >> >> >> >> Tatsuo Ishii
>> >> >> >> >> SRA OSS K.K.
>> >> >> >> >> English: http://www.sraoss.co.jp/index_en/
>> >> >> >> >> Japanese:http://www.sraoss.co.jp
>> >> >> >> >>
>> >> >> >> >> > On Mon, Nov 3, 2025 at 9:05 AM Tatsuo Ishii <
>> >> [email protected]>
>> >> >> >> >> wrote:
>> >> >> >> >> >
>> >> >> >> >> >> > thanks and sorry for the issues, please find attached
>> updated
>> >> >> >> version.
>> >> >> >> >> >>
>> >> >> >> >> >> No problem.
>> >> >> >> >> >>
>> >> >> >> >> >> This time the patch applies fine, no compiler warnings.
>> >> However,
>> >> >> >> >> >> regression test did not passed here (on Ubuntu 24 LTS if
>> this
>> >> >> >> >> >> matters).  So I looked into
>> >> >> >> >> >>
>> >> src/test/regression/tests/041.external_replication_delay/test.sh a
>> >> >> >> >> >> little bit and apply attached patch (test.sh.patch). It
>> moved
>> >> >> forward
>> >> >> >> >> >> partially but failed at:
>> >> >> >> >> >>
>> >> >> >> >> >> fail: command execution failure not detected
>> >> >> >> >> >>
>> >> >> >> >> >> Please find attached
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> src/test/regression/tests/041.external_replication_delay/testdir/pgpool.log
>> >> >> >> >> >> and src/test/regression/log/041.external_replication_delay.
>> >> >> >> >> >>
>> >> >> >> >> >> Best regards,
>> >> >> >> >> >> --
>> >> >> >> >> >> Tatsuo Ishii
>> >> >> >> >> >> SRA OSS K.K.
>> >> >> >> >> >> English: http://www.sraoss.co.jp/index_en/
>> >> >> >> >> >> Japanese:http://www.sraoss.co.jp
>> >> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> > --
>> >> >> >> >> > Nadav Shatz
>> >> >> >> >> > Tailor Brands | CTO
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > --
>> >> >> >> > Nadav Shatz
>> >> >> >> > Tailor Brands | CTO
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Nadav Shatz
>> >> >> > Tailor Brands | CTO
>> >> >>
>> >> >
>> >> >
>> >> > --
>> >> > Nadav Shatz
>> >> > Tailor Brands | CTO
>> >>
>> >
>> >
>> > --
>> > Nadav Shatz
>> > Tailor Brands | CTO
>>
> 
> 
> -- 
> Nadav Shatz
> Tailor Brands | CTO


Attachments:

  [application/octet-stream] extenal_delay_cmd.patch (8.2K, 2-extenal_delay_cmd.patch)
  download | inline diff:
diff --git a/src/streaming_replication/pool_worker_child.c b/src/streaming_replication/pool_worker_child.c
index 457d0fab0..c509ba5bc 100644
--- a/src/streaming_replication/pool_worker_child.c
+++ b/src/streaming_replication/pool_worker_child.c
@@ -132,7 +132,7 @@ do_worker_child(void *params)
 	signal(SIGINT, my_signal_handler);
 	signal(SIGHUP, reload_config_handler);
 	signal(SIGQUIT, my_signal_handler);
-	signal(SIGCHLD, SIG_IGN);
+	signal(SIGCHLD, my_signal_handler);
 	signal(SIGUSR1, my_signal_handler);
 	signal(SIGUSR2, SIG_IGN);
 	signal(SIGPIPE, SIG_IGN);
@@ -262,16 +262,20 @@ do_worker_child(void *params)
 					POOL_NODE_STATUS *node_status;
 					int			i;
 
-			/* Do replication time lag checking */
-			/* Use external command if replication_delay_source_cmd is configured */
-			if (pool_config->replication_delay_source_cmd &&
-				strlen(pool_config->replication_delay_source_cmd) > 0)
-				check_replication_time_lag_with_cmd();
-			else
-				check_replication_time_lag();
+					/* Do replication time lag checking */
 
-			/* Check node status */
-			node_status = verify_backend_node_status(slots);
+					/*
+					 * Use external command if replication_delay_source_cmd is
+					 * configured
+					 */
+					if (pool_config->replication_delay_source_cmd &&
+						strlen(pool_config->replication_delay_source_cmd) > 0)
+						check_replication_time_lag_with_cmd();
+					else
+						check_replication_time_lag();
+
+					/* Check node status */
+					node_status = verify_backend_node_status(slots);
 
 
 					for (i = 0; i < NUM_BACKENDS; i++)
@@ -668,7 +672,7 @@ check_replication_time_lag(void)
 }
 
 #define MAX_CMD_OUTPUT 4096
-#define MAX_REASONABLE_DELAY_MS 3600000.0  /* 1 hour in milliseconds */
+#define MAX_REASONABLE_DELAY_MS 3600000.0	/* 1 hour in milliseconds */
 
 /*
  * Check replication time lag using external command
@@ -680,23 +684,23 @@ check_replication_time_lag(void)
 static void
 check_replication_time_lag_with_cmd(void)
 {
-	char		   *command = NULL;
-	char		   *line;
-	char		   *token;
-	char		   *saveptr;
-	double			delay_ms;
-	uint64			delay;
-	int				token_count = 0;
-	BackendInfo	   *bkinfo;
+	char	   *command = NULL;
+	char	   *line;
+	char	   *token;
+	char	   *saveptr;
+	double		delay_ms;
+	uint64		delay;
+	int			token_count = 0;
+	BackendInfo *bkinfo;
 	ErrorContextCallback callback;
-	int				pipefd[2] = {-1, -1};
-	pid_t			pid = -1;
-	int				ret;
-	struct timeval	timeout;
-	fd_set			readfds;
-	ssize_t			bytes_read;
-	int				status;
-	int				num_replicas;
+	int			pipefd[2] = {-1, -1};
+	pid_t		pid = -1;
+	int			ret;
+	struct timeval timeout;
+	fd_set		readfds;
+	ssize_t		bytes_read;
+	int			status;
+	int			num_replicas;
 
 	if (NUM_BACKENDS <= 1)
 	{
@@ -717,7 +721,7 @@ check_replication_time_lag_with_cmd(void)
 	}
 
 	/* Capture primary node ID to avoid race conditions during execution */
-	int primary_node_id = REAL_PRIMARY_NODE_ID;
+	int			primary_node_id = REAL_PRIMARY_NODE_ID;
 
 	if (!pool_config->replication_delay_source_cmd ||
 		strlen(pool_config->replication_delay_source_cmd) == 0)
@@ -746,16 +750,21 @@ check_replication_time_lag_with_cmd(void)
 	PG_TRY();
 	{
 		const char *base_command = pool_config->replication_delay_source_cmd;
-		size_t total_len = strlen(base_command) + 1; /* +1 for NUL */
+		size_t		total_len = strlen(base_command) + 1;	/* +1 for NUL */
 
 		/* Build command with replica-only arguments (omit primary) */
-		/* Calculate total command length including space-separated replica identifiers */
+
+		/*
+		 * Calculate total command length including space-separated replica
+		 * identifiers
+		 */
 		for (int i = 0; i < NUM_BACKENDS; i++)
 		{
 			if (i == primary_node_id)
-				continue; /* Skip primary node */
+				continue;		/* Skip primary node */
+
+			char	   *ident = build_instance_identifier_for_node(i);
 
-			char *ident = build_instance_identifier_for_node(i);
 			total_len += 1 /* space */ + strlen(ident);
 			pfree(ident);
 		}
@@ -764,13 +773,14 @@ check_replication_time_lag_with_cmd(void)
 		strlcpy(command, base_command, total_len);
 
 		/* Append replica identifiers */
-		size_t current_len = strlen(command);
+		size_t		current_len = strlen(command);
+
 		for (int i = 0; i < NUM_BACKENDS; i++)
 		{
 			if (i == primary_node_id)
-				continue; /* Skip primary node */
+				continue;		/* Skip primary node */
 
-			char *ident = build_instance_identifier_for_node(i);
+			char	   *ident = build_instance_identifier_for_node(i);
 
 			/* Append space and identifier */
 			snprintf(command + current_len, total_len - current_len, " %s", ident);
@@ -800,16 +810,16 @@ check_replication_time_lag_with_cmd(void)
 		if (pid == 0)
 		{
 			/* Child process */
-			close(pipefd[0]); /* Close read end */
+			close(pipefd[0]);	/* Close read end */
 			if (dup2(pipefd[1], STDOUT_FILENO) == -1)
 			{
 				fprintf(stderr, "dup2 failed: %s\n", strerror(errno));
 				exit(1);
 			}
-			close(pipefd[1]); /* Close write end (duplicated to stdout) */
+			close(pipefd[1]);	/* Close write end (duplicated to stdout) */
 
 			/* Execute command using shell */
-			execl("/bin/sh", "sh", "-c", command, (char *)NULL);
+			execl("/bin/sh", "sh", "-c", command, (char *) NULL);
 
 			/* If execl fails */
 			fprintf(stderr, "execl failed: %s\n", strerror(errno));
@@ -817,7 +827,7 @@ check_replication_time_lag_with_cmd(void)
 		}
 
 		/* Parent process */
-		close(pipefd[1]); /* Close write end */
+		close(pipefd[1]);		/* Close write end */
 		pipefd[1] = -1;
 
 		/* Set up timeout for select */
@@ -832,7 +842,8 @@ check_replication_time_lag_with_cmd(void)
 
 		if (ret == -1)
 		{
-			int save_errno = errno;
+			int			save_errno = errno;
+
 			kill(pid, SIGKILL);
 			waitpid(pid, NULL, 0);
 			pid = -1;
@@ -913,11 +924,12 @@ check_replication_time_lag_with_cmd(void)
 		bkinfo->standby_delay_by_time = true;
 
 		/* Count expected replicas */
-		num_replicas = NUM_BACKENDS - 1; /* Total nodes minus primary */
+		num_replicas = NUM_BACKENDS - 1;	/* Total nodes minus primary */
 
 		/* Count tokens in output for validation */
-		char *line_copy = pstrdup(line);
-		char *temp_token = strtok(line_copy, " \t\n");
+		char	   *line_copy = pstrdup(line);
+		char	   *temp_token = strtok(line_copy, " \t\n");
+
 		while (temp_token != NULL)
 		{
 			token_count++;
@@ -953,7 +965,7 @@ check_replication_time_lag_with_cmd(void)
 		for (int i = 0; i < NUM_BACKENDS && token != NULL; i++)
 		{
 			if (i == primary_node_id)
-				continue; /* Skip primary - it's not in the output */
+				continue;		/* Skip primary - it's not in the output */
 
 			if (!VALID_BACKEND(i))
 			{
@@ -962,7 +974,8 @@ check_replication_time_lag_with_cmd(void)
 				continue;
 			}
 
-			char *endptr;
+			char	   *endptr;
+
 			delay_ms = strtod(token, &endptr);
 
 			/* Validate the conversion */
@@ -1002,13 +1015,18 @@ check_replication_time_lag_with_cmd(void)
 								delay_ms, i)));
 			}
 
-			/* Convert delay from milliseconds to microseconds for internal storage */
-			delay = (uint64)(delay_ms * 1000);
+			/*
+			 * Convert delay from milliseconds to microseconds for internal
+			 * storage
+			 */
+			delay = (uint64) (delay_ms * 1000);
 			bkinfo->standby_delay = delay;
 			bkinfo->standby_delay_by_time = true;
 
 			/* Log delay if necessary */
-			uint64 delay_threshold_by_time = pool_config->delay_threshold_by_time * 1000; /* threshold is in milliseconds, convert to microseconds */
+			uint64		delay_threshold_by_time = pool_config->delay_threshold_by_time * 1000;	/* threshold is in
+																								 * milliseconds, convert
+																								 * to microseconds */
 
 			if ((pool_config->log_standby_delay == LSD_ALWAYS && delay_ms > 0) ||
 				(pool_config->log_standby_delay == LSD_OVER_THRESHOLD &&
@@ -1026,12 +1044,15 @@ check_replication_time_lag_with_cmd(void)
 	PG_CATCH();
 	{
 		/* Cleanup in case of error */
-		if (pid > 0) {
+		if (pid > 0)
+		{
 			kill(pid, SIGKILL);
 			waitpid(pid, NULL, 0);
 		}
-		if (pipefd[0] != -1) close(pipefd[0]);
-		if (pipefd[1] != -1) close(pipefd[1]);
+		if (pipefd[0] != -1)
+			close(pipefd[0]);
+		if (pipefd[1] != -1)
+			close(pipefd[1]);
 
 		if (line)
 			pfree(line);
@@ -1137,6 +1158,9 @@ static RETSIGTYPE my_signal_handler(int sig)
 			restart_request = 1;
 			break;
 
+		case SIGCHLD:
+			break;
+
 		default:
 			exit(1);
 			break;


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-12-26 10:03  Tatsuo Ishii <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2025-12-26 10:03 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

Hi Nadav,

I just want to make it clear. The patch should be applied on top of
your latest.patch.

> (Please disregard previous mail. I seem to have mangled the message).
> 
> I think I found a cause of the problem. On Linux, if SIGCHLD is
> ignored (set to SIG_IGN), waitpid() cannot get proper child status.
> Because the kernel relcaims the resource for the child process to not
> make the child process a zombie. And this makes waitpid() to fail with
> ECHLD. Since the return of waitpid() is not checked, I did not notice
> the waitpid() failure (I recommend to check the return value of
> waitpid()).
> 
> 	/* set up signal handlers */
> 	signal(SIGALRM, SIG_DFL);
> 	signal(SIGTERM, my_signal_handler);
> 	signal(SIGINT, my_signal_handler);
> 	signal(SIGHUP, reload_config_handler);
> 	signal(SIGQUIT, my_signal_handler);
> 	signal(SIGCHLD, SIG_IGN);	<--- SIGCHLD is ignored
> 	signal(SIGUSR1, my_signal_handler);
> 	signal(SIGUSR2, SIG_IGN);
> 	signal(SIGPIPE, SIG_IGN);
> 
> To fix this, either change the line above to:
> 
> 	signal(SIGCHLD, SIG_DFL);
> or
> 	signal(SIGCHLD, my_signal_handler);
> 	and modify my_signal_handler.
> 
> I recommend the latter, because it does not depend on the default
> behavior of SIGCHLD, which might be different per platform.
> Attached is the patch to do this. (and run pgindent).
> I also notice that something like:
> 
> 		/* Count tokens in output for validation */
> 		char *line_copy = pstrdup(line);
> 		char *temp_token = strtok(line_copy, " \t\n");
> 
> You should declare line_copy and temp_token in the begging of the code
> block (or in the outer block).  The forward declaration is recommended
> coding style in Pgpool-II (and PostgreSQL). Same thing can be said to
> some other variables.
> 
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
> 
>> Hi Tatsuo,
>> 
>> Thank you for the note.
>> 
>> I've removed the docker stuff. started working in an ubuntu 24 VM to match
>> the setup. hopefully the results will be better, had so many issues
>> compiling and testing before that stuff wasn't properly formulated.
>> 
>> Attaching the latest patch.
>> 
>> this is what i'm seeing:
>> adav@lima-dev:/src/pgpool2/src/test/regression$ PGHOST=/tmp ./regress.sh -p
>> /usr/bin 041.external_replication_delay
>> creating pgpool-II temporary installation ...
>> moving pgpool_setup to temporary installation path ...
>> moving watchdog_setup to temporary installation path ...
>> using pgpool-II at /src/pgpool2/src/test/regression/temp/installed
>> *************************
>> REGRESSION MODE          : install
>> Pgpool-II version        : pgpool-II version 4.8devel (mitsukakeboshi)
>> Pgpool-II install path   : /src/pgpool2/src/test/regression/temp/installed
>> PostgreSQL bin           : /usr/lib/postgresql/16/bin
>> PostgreSQL Major version : 16
>> pgbench                  : /usr/lib/postgresql/16/bin/pgbench
>> PostgreSQL jdbc          :
>> /usr/local/pgsql/share/postgresql-9.2-1003.jdbc4.jar
>> *************************
>> testing 041.external_replication_delay...ok.
>> out of 1 ok:1 failed:0 timeout:0
>> 
>> 
>> 
>> On Tue, Dec 23, 2025 at 10:46 AM Tatsuo Ishii <[email protected]> wrote:
>> 
>>> > Hi Tatsuo,
>>> >
>>> > I'km running into issues testing this and have created a full docker
>>> > compose setup - can you please point me to up to date guides on the best
>>> > way to run the tests so i know we're doing it the same way?
>>> >
>>> > Thank you for all your help!
>>>
>>> I have run the regression test on the Pgpool-II master branch on my
>>> Ubuntu 24 box.
>>>
>>> cd pgpool2/src/test/regression
>>> ./regress.sh 041
>>>
>>> This time I noticed:
>>>
>>> - The patch does not named with version number
>>> - The patch creates .dockerignore and docker/ directory.
>>>
>>> Are they intended? I am asking because they are different from the
>>> previous version.
>>>
>>> > On Tue, Dec 23, 2025 at 2:13 AM Tatsuo Ishii <[email protected]>
>>> wrote:
>>> >
>>> >> > I think everything is passing now. new version attached.
>>> >>
>>> >> Unfortunately Test1 did not pass.
>>> >>
>>> >> === Test1: Basic external command with integer millisecond values ===
>>> >> waiting for server to start....1438600 2025-12-23 09:09:48.337 JST LOG:
>>> >> redirecting log output to logging collector process
>>> >> 1438600 2025-12-23 09:09:48.337 JST HINT:  Future log output will appear
>>> >> in directory "log".
>>> >>  done
>>> >> server started
>>> >> waiting for server to start....1438617 2025-12-23 09:09:48.443 JST LOG:
>>> >> redirecting log output to logging collector process
>>> >> 1438617 2025-12-23 09:09:48.443 JST HINT:  Future log output will appear
>>> >> in directory "log".
>>> >>  done
>>> >> server started
>>> >> waiting for server to start....1438634 2025-12-23 09:09:48.561 JST LOG:
>>> >> redirecting log output to logging collector process
>>> >> 1438634 2025-12-23 09:09:48.561 JST HINT:  Future log output will appear
>>> >> in directory "log".
>>> >>  done
>>> >> server started
>>> >> CREATE TABLE
>>> >> Waiting for sr_check to run...
>>> >> Command executed after 1 seconds
>>> >>  node_id | hostname  | port  | status | pg_status | lb_weight |  role
>>>  |
>>> >> pg_role | select_cnt | load_balance_node | replication_delay |
>>> >> replication_state | replication_sync_state | last_status_change
>>> >>
>>> >>
>>> ---------+-----------+-------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
>>> >>  0       | localhost | 11002 | up     | up        | 0.333333  | primary
>>> |
>>> >> primary | 0          | true              | 0                 |
>>> >>      |                        | 2025-12-23 09:09:49
>>> >>  1       | localhost | 11003 | up     | up        | 0.333333  | standby
>>> |
>>> >> standby | 0          | false             | 0                 |
>>> >>      |                        | 2025-12-23 09:09:49
>>> >>  2       | localhost | 11004 | up     | up        | 0.333333  | standby
>>> |
>>> >> standby | 0          | false             | 0                 |
>>> >>      |                        | 2025-12-23 09:09:49
>>> >> (3 rows)
>>> >>
>>> >> fail: external command delay logging not found
>>> >>
>>> >> > On Mon, Nov 24, 2025 at 9:41 AM Tatsuo Ishii <[email protected]>
>>> >> wrote:
>>> >> >
>>> >> >> Thank you for updating the patch! This time the patch applies without
>>> >> >> any issue and compiles fine. Unfortunately regression test failed.
>>> >> >>
>>> >> >> testing 041.external_replication_delay...failed.
>>> >> >>
>>> >> >> From the regression log, it seems Test7 failed.
>>> >> >>
>>> >> >>
>>> >>
>>> ------------------------------------------------------------------------------
>>> >> >> === Test7: Command timeout handling ===
>>> >> >> waiting for server to start....411181 2025-11-24 16:31:05.244 JST
>>> LOG:
>>> >> >> redirecting log output to logging collector process
>>> >> >> 411181 2025-11-24 16:31:05.244 JST HINT:  Future log output will
>>> appear
>>> >> in
>>> >> >> directory "log".
>>> >> >>  done
>>> >> >> server started
>>> >> >> waiting for server to start....411196 2025-11-24 16:31:05.352 JST
>>> LOG:
>>> >> >> redirecting log output to logging collector process
>>> >> >> 411196 2025-11-24 16:31:05.352 JST HINT:  Future log output will
>>> appear
>>> >> in
>>> >> >> directory "log".
>>> >> >>  done
>>> >> >> server started
>>> >> >> waiting for server to start....411213 2025-11-24 16:31:05.461 JST
>>> LOG:
>>> >> >> redirecting log output to logging collector process
>>> >> >> 411213 2025-11-24 16:31:05.461 JST HINT:  Future log output will
>>> appear
>>> >> in
>>> >> >> directory "log".
>>> >> >>  done
>>> >> >> server started
>>> >> >> Waiting for command timeout...
>>> >> >> fail: command timeout not detected
>>> >> >>
>>> >> >>
>>> >>
>>> ------------------------------------------------------------------------------
>>> >> >>
>>> >> >> Attached is the pgpool.log. If you need more info, please let me
>>> know.
>>> >> >>
>>> >> >> Best regards,
>>> >> >> --
>>> >> >> Tatsuo Ishii
>>> >> >> SRA OSS K.K.
>>> >> >> English: http://www.sraoss.co.jp/index_en/
>>> >> >> Japanese:http://www.sraoss.co.jp
>>> >> >>
>>> >> >>
>>> >> >> > Hi Tatsuo,
>>> >> >> >
>>> >> >> > Sorry again, this was due to the separation of 2 patches and i only
>>> >> sent
>>> >> >> > the one.
>>> >> >> >
>>> >> >> > I've merged it into 1 commit and 1 patch and rebased over master to
>>> >> avoid
>>> >> >> > these issues moving forward.
>>> >> >> >
>>> >> >> > PFA latest version
>>> >> >> >
>>> >> >> > On Thu, Nov 20, 2025 at 1:09 AM Tatsuo Ishii <[email protected]
>>> >
>>> >> >> wrote:
>>> >> >> >
>>> >> >> >> Hi Nadav,
>>> >> >> >>
>>> >> >> >> Thank you for new patch.
>>> >> >> >> Unfortunately the patch did not apply to current master.
>>> >> >> >>
>>> >> >> >> $ git apply
>>> >> >> >> ~/0001-Fix-multiple-issues-in-external-replication-delay-fe.patch
>>> >> >> >> error: patch failed:
>>> >> src/streaming_replication/pool_worker_child.c:694
>>> >> >> >> error: src/streaming_replication/pool_worker_child.c: patch does
>>> not
>>> >> >> apply
>>> >> >> >>
>>> >> >> >> Maybe the patch is on top of your previous patch?
>>> >> >> >>
>>> >> >> >> Also I suggest to use "-v" option of "git format-patch" to add the
>>> >> >> >> patch version number so that we can easily know which patch is the
>>> >> >> >> latest.
>>> >> >> >>
>>> >> >> >> Best regards,
>>> >> >> >> --
>>> >> >> >> Tatsuo Ishii
>>> >> >> >> SRA OSS K.K.
>>> >> >> >> English: http://www.sraoss.co.jp/index_en/
>>> >> >> >> Japanese:http://www.sraoss.co.jp
>>> >> >> >>
>>> >> >> >> > Hi Tatsuo,
>>> >> >> >> >
>>> >> >> >> > Please see attached an updated version.
>>> >> >> >> >
>>> >> >> >> > thank you
>>> >> >> >> >
>>> >> >> >> > On Fri, Nov 7, 2025 at 2:07 AM Tatsuo Ishii <
>>> [email protected]>
>>> >> >> >> wrote:
>>> >> >> >> >
>>> >> >> >> >> > Sorry for that - thanks for the patch.
>>> >> >> >> >> >
>>> >> >> >> >> > Please find attached a new version
>>> >> >> >> >>
>>> >> >> >> >> Thanks for the new version. Unfortunately this time regression
>>> >> test
>>> >> >> >> >> fails at:
>>> >> >> >> >>
>>> >> >> >> >> > Waiting for command timeout...
>>> >> >> >> >> > fail: command timeout not detected
>>> >> >> >> >>
>>> >> >> >> >> Attached is the pgpool.log.
>>> >> >> >> >>
>>> >> >> >> >> Best regards,
>>> >> >> >> >> --
>>> >> >> >> >> Tatsuo Ishii
>>> >> >> >> >> SRA OSS K.K.
>>> >> >> >> >> English: http://www.sraoss.co.jp/index_en/
>>> >> >> >> >> Japanese:http://www.sraoss.co.jp
>>> >> >> >> >>
>>> >> >> >> >> > On Mon, Nov 3, 2025 at 9:05 AM Tatsuo Ishii <
>>> >> [email protected]>
>>> >> >> >> >> wrote:
>>> >> >> >> >> >
>>> >> >> >> >> >> > thanks and sorry for the issues, please find attached
>>> updated
>>> >> >> >> version.
>>> >> >> >> >> >>
>>> >> >> >> >> >> No problem.
>>> >> >> >> >> >>
>>> >> >> >> >> >> This time the patch applies fine, no compiler warnings.
>>> >> However,
>>> >> >> >> >> >> regression test did not passed here (on Ubuntu 24 LTS if
>>> this
>>> >> >> >> >> >> matters).  So I looked into
>>> >> >> >> >> >>
>>> >> src/test/regression/tests/041.external_replication_delay/test.sh a
>>> >> >> >> >> >> little bit and apply attached patch (test.sh.patch). It
>>> moved
>>> >> >> forward
>>> >> >> >> >> >> partially but failed at:
>>> >> >> >> >> >>
>>> >> >> >> >> >> fail: command execution failure not detected
>>> >> >> >> >> >>
>>> >> >> >> >> >> Please find attached
>>> >> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >>
>>> >> >>
>>> >>
>>> src/test/regression/tests/041.external_replication_delay/testdir/pgpool.log
>>> >> >> >> >> >> and src/test/regression/log/041.external_replication_delay.
>>> >> >> >> >> >>
>>> >> >> >> >> >> Best regards,
>>> >> >> >> >> >> --
>>> >> >> >> >> >> Tatsuo Ishii
>>> >> >> >> >> >> SRA OSS K.K.
>>> >> >> >> >> >> English: http://www.sraoss.co.jp/index_en/
>>> >> >> >> >> >> Japanese:http://www.sraoss.co.jp
>>> >> >> >> >> >>
>>> >> >> >> >> >
>>> >> >> >> >> >
>>> >> >> >> >> > --
>>> >> >> >> >> > Nadav Shatz
>>> >> >> >> >> > Tailor Brands | CTO
>>> >> >> >> >>
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> > --
>>> >> >> >> > Nadav Shatz
>>> >> >> >> > Tailor Brands | CTO
>>> >> >> >>
>>> >> >> >
>>> >> >> >
>>> >> >> > --
>>> >> >> > Nadav Shatz
>>> >> >> > Tailor Brands | CTO
>>> >> >>
>>> >> >
>>> >> >
>>> >> > --
>>> >> > Nadav Shatz
>>> >> > Tailor Brands | CTO
>>> >>
>>> >
>>> >
>>> > --
>>> > Nadav Shatz
>>> > Tailor Brands | CTO
>>>
>> 
>> 
>> -- 
>> Nadav Shatz
>> Tailor Brands | CTO


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-12-28 12:21  Nadav Shatz <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Nadav Shatz @ 2025-12-28 12:21 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

thank you! works for me. should we merge both into master or do you want me
to send a combined one?

On Fri, Dec 26, 2025 at 12:03 PM Tatsuo Ishii <[email protected]> wrote:

> Hi Nadav,
>
> I just want to make it clear. The patch should be applied on top of
> your latest.patch.
>
> > (Please disregard previous mail. I seem to have mangled the message).
> >
> > I think I found a cause of the problem. On Linux, if SIGCHLD is
> > ignored (set to SIG_IGN), waitpid() cannot get proper child status.
> > Because the kernel relcaims the resource for the child process to not
> > make the child process a zombie. And this makes waitpid() to fail with
> > ECHLD. Since the return of waitpid() is not checked, I did not notice
> > the waitpid() failure (I recommend to check the return value of
> > waitpid()).
> >
> >       /* set up signal handlers */
> >       signal(SIGALRM, SIG_DFL);
> >       signal(SIGTERM, my_signal_handler);
> >       signal(SIGINT, my_signal_handler);
> >       signal(SIGHUP, reload_config_handler);
> >       signal(SIGQUIT, my_signal_handler);
> >       signal(SIGCHLD, SIG_IGN);       <--- SIGCHLD is ignored
> >       signal(SIGUSR1, my_signal_handler);
> >       signal(SIGUSR2, SIG_IGN);
> >       signal(SIGPIPE, SIG_IGN);
> >
> > To fix this, either change the line above to:
> >
> >       signal(SIGCHLD, SIG_DFL);
> > or
> >       signal(SIGCHLD, my_signal_handler);
> >       and modify my_signal_handler.
> >
> > I recommend the latter, because it does not depend on the default
> > behavior of SIGCHLD, which might be different per platform.
> > Attached is the patch to do this. (and run pgindent).
> > I also notice that something like:
> >
> >               /* Count tokens in output for validation */
> >               char *line_copy = pstrdup(line);
> >               char *temp_token = strtok(line_copy, " \t\n");
> >
> > You should declare line_copy and temp_token in the begging of the code
> > block (or in the outer block).  The forward declaration is recommended
> > coding style in Pgpool-II (and PostgreSQL). Same thing can be said to
> > some other variables.
> >
> > Best regards,
> > --
> > Tatsuo Ishii
> > SRA OSS K.K.
> > English: http://www.sraoss.co.jp/index_en/
> > Japanese:http://www.sraoss.co.jp
> >
> >> Hi Tatsuo,
> >>
> >> Thank you for the note.
> >>
> >> I've removed the docker stuff. started working in an ubuntu 24 VM to
> match
> >> the setup. hopefully the results will be better, had so many issues
> >> compiling and testing before that stuff wasn't properly formulated.
> >>
> >> Attaching the latest patch.
> >>
> >> this is what i'm seeing:
> >> adav@lima-dev:/src/pgpool2/src/test/regression$ PGHOST=/tmp
> ./regress.sh -p
> >> /usr/bin 041.external_replication_delay
> >> creating pgpool-II temporary installation ...
> >> moving pgpool_setup to temporary installation path ...
> >> moving watchdog_setup to temporary installation path ...
> >> using pgpool-II at /src/pgpool2/src/test/regression/temp/installed
> >> *************************
> >> REGRESSION MODE          : install
> >> Pgpool-II version        : pgpool-II version 4.8devel (mitsukakeboshi)
> >> Pgpool-II install path   :
> /src/pgpool2/src/test/regression/temp/installed
> >> PostgreSQL bin           : /usr/lib/postgresql/16/bin
> >> PostgreSQL Major version : 16
> >> pgbench                  : /usr/lib/postgresql/16/bin/pgbench
> >> PostgreSQL jdbc          :
> >> /usr/local/pgsql/share/postgresql-9.2-1003.jdbc4.jar
> >> *************************
> >> testing 041.external_replication_delay...ok.
> >> out of 1 ok:1 failed:0 timeout:0
> >>
> >>
> >>
> >> On Tue, Dec 23, 2025 at 10:46 AM Tatsuo Ishii <[email protected]>
> wrote:
> >>
> >>> > Hi Tatsuo,
> >>> >
> >>> > I'km running into issues testing this and have created a full docker
> >>> > compose setup - can you please point me to up to date guides on the
> best
> >>> > way to run the tests so i know we're doing it the same way?
> >>> >
> >>> > Thank you for all your help!
> >>>
> >>> I have run the regression test on the Pgpool-II master branch on my
> >>> Ubuntu 24 box.
> >>>
> >>> cd pgpool2/src/test/regression
> >>> ./regress.sh 041
> >>>
> >>> This time I noticed:
> >>>
> >>> - The patch does not named with version number
> >>> - The patch creates .dockerignore and docker/ directory.
> >>>
> >>> Are they intended? I am asking because they are different from the
> >>> previous version.
> >>>
> >>> > On Tue, Dec 23, 2025 at 2:13 AM Tatsuo Ishii <[email protected]>
> >>> wrote:
> >>> >
> >>> >> > I think everything is passing now. new version attached.
> >>> >>
> >>> >> Unfortunately Test1 did not pass.
> >>> >>
> >>> >> === Test1: Basic external command with integer millisecond values
> ===
> >>> >> waiting for server to start....1438600 2025-12-23 09:09:48.337 JST
> LOG:
> >>> >> redirecting log output to logging collector process
> >>> >> 1438600 2025-12-23 09:09:48.337 JST HINT:  Future log output will
> appear
> >>> >> in directory "log".
> >>> >>  done
> >>> >> server started
> >>> >> waiting for server to start....1438617 2025-12-23 09:09:48.443 JST
> LOG:
> >>> >> redirecting log output to logging collector process
> >>> >> 1438617 2025-12-23 09:09:48.443 JST HINT:  Future log output will
> appear
> >>> >> in directory "log".
> >>> >>  done
> >>> >> server started
> >>> >> waiting for server to start....1438634 2025-12-23 09:09:48.561 JST
> LOG:
> >>> >> redirecting log output to logging collector process
> >>> >> 1438634 2025-12-23 09:09:48.561 JST HINT:  Future log output will
> appear
> >>> >> in directory "log".
> >>> >>  done
> >>> >> server started
> >>> >> CREATE TABLE
> >>> >> Waiting for sr_check to run...
> >>> >> Command executed after 1 seconds
> >>> >>  node_id | hostname  | port  | status | pg_status | lb_weight |
> role
> >>>  |
> >>> >> pg_role | select_cnt | load_balance_node | replication_delay |
> >>> >> replication_state | replication_sync_state | last_status_change
> >>> >>
> >>> >>
> >>>
> ---------+-----------+-------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
> >>> >>  0       | localhost | 11002 | up     | up        | 0.333333  |
> primary
> >>> |
> >>> >> primary | 0          | true              | 0                 |
> >>> >>      |                        | 2025-12-23 09:09:49
> >>> >>  1       | localhost | 11003 | up     | up        | 0.333333  |
> standby
> >>> |
> >>> >> standby | 0          | false             | 0                 |
> >>> >>      |                        | 2025-12-23 09:09:49
> >>> >>  2       | localhost | 11004 | up     | up        | 0.333333  |
> standby
> >>> |
> >>> >> standby | 0          | false             | 0                 |
> >>> >>      |                        | 2025-12-23 09:09:49
> >>> >> (3 rows)
> >>> >>
> >>> >> fail: external command delay logging not found
> >>> >>
> >>> >> > On Mon, Nov 24, 2025 at 9:41 AM Tatsuo Ishii <
> [email protected]>
> >>> >> wrote:
> >>> >> >
> >>> >> >> Thank you for updating the patch! This time the patch applies
> without
> >>> >> >> any issue and compiles fine. Unfortunately regression test
> failed.
> >>> >> >>
> >>> >> >> testing 041.external_replication_delay...failed.
> >>> >> >>
> >>> >> >> From the regression log, it seems Test7 failed.
> >>> >> >>
> >>> >> >>
> >>> >>
> >>>
> ------------------------------------------------------------------------------
> >>> >> >> === Test7: Command timeout handling ===
> >>> >> >> waiting for server to start....411181 2025-11-24 16:31:05.244 JST
> >>> LOG:
> >>> >> >> redirecting log output to logging collector process
> >>> >> >> 411181 2025-11-24 16:31:05.244 JST HINT:  Future log output will
> >>> appear
> >>> >> in
> >>> >> >> directory "log".
> >>> >> >>  done
> >>> >> >> server started
> >>> >> >> waiting for server to start....411196 2025-11-24 16:31:05.352 JST
> >>> LOG:
> >>> >> >> redirecting log output to logging collector process
> >>> >> >> 411196 2025-11-24 16:31:05.352 JST HINT:  Future log output will
> >>> appear
> >>> >> in
> >>> >> >> directory "log".
> >>> >> >>  done
> >>> >> >> server started
> >>> >> >> waiting for server to start....411213 2025-11-24 16:31:05.461 JST
> >>> LOG:
> >>> >> >> redirecting log output to logging collector process
> >>> >> >> 411213 2025-11-24 16:31:05.461 JST HINT:  Future log output will
> >>> appear
> >>> >> in
> >>> >> >> directory "log".
> >>> >> >>  done
> >>> >> >> server started
> >>> >> >> Waiting for command timeout...
> >>> >> >> fail: command timeout not detected
> >>> >> >>
> >>> >> >>
> >>> >>
> >>>
> ------------------------------------------------------------------------------
> >>> >> >>
> >>> >> >> Attached is the pgpool.log. If you need more info, please let me
> >>> know.
> >>> >> >>
> >>> >> >> Best regards,
> >>> >> >> --
> >>> >> >> Tatsuo Ishii
> >>> >> >> SRA OSS K.K.
> >>> >> >> English: http://www.sraoss.co.jp/index_en/
> >>> >> >> Japanese:http://www.sraoss.co.jp
> >>> >> >>
> >>> >> >>
> >>> >> >> > Hi Tatsuo,
> >>> >> >> >
> >>> >> >> > Sorry again, this was due to the separation of 2 patches and i
> only
> >>> >> sent
> >>> >> >> > the one.
> >>> >> >> >
> >>> >> >> > I've merged it into 1 commit and 1 patch and rebased over
> master to
> >>> >> avoid
> >>> >> >> > these issues moving forward.
> >>> >> >> >
> >>> >> >> > PFA latest version
> >>> >> >> >
> >>> >> >> > On Thu, Nov 20, 2025 at 1:09 AM Tatsuo Ishii <
> [email protected]
> >>> >
> >>> >> >> wrote:
> >>> >> >> >
> >>> >> >> >> Hi Nadav,
> >>> >> >> >>
> >>> >> >> >> Thank you for new patch.
> >>> >> >> >> Unfortunately the patch did not apply to current master.
> >>> >> >> >>
> >>> >> >> >> $ git apply
> >>> >> >> >>
> ~/0001-Fix-multiple-issues-in-external-replication-delay-fe.patch
> >>> >> >> >> error: patch failed:
> >>> >> src/streaming_replication/pool_worker_child.c:694
> >>> >> >> >> error: src/streaming_replication/pool_worker_child.c: patch
> does
> >>> not
> >>> >> >> apply
> >>> >> >> >>
> >>> >> >> >> Maybe the patch is on top of your previous patch?
> >>> >> >> >>
> >>> >> >> >> Also I suggest to use "-v" option of "git format-patch" to
> add the
> >>> >> >> >> patch version number so that we can easily know which patch
> is the
> >>> >> >> >> latest.
> >>> >> >> >>
> >>> >> >> >> Best regards,
> >>> >> >> >> --
> >>> >> >> >> Tatsuo Ishii
> >>> >> >> >> SRA OSS K.K.
> >>> >> >> >> English: http://www.sraoss.co.jp/index_en/
> >>> >> >> >> Japanese:http://www.sraoss.co.jp
> >>> >> >> >>
> >>> >> >> >> > Hi Tatsuo,
> >>> >> >> >> >
> >>> >> >> >> > Please see attached an updated version.
> >>> >> >> >> >
> >>> >> >> >> > thank you
> >>> >> >> >> >
> >>> >> >> >> > On Fri, Nov 7, 2025 at 2:07 AM Tatsuo Ishii <
> >>> [email protected]>
> >>> >> >> >> wrote:
> >>> >> >> >> >
> >>> >> >> >> >> > Sorry for that - thanks for the patch.
> >>> >> >> >> >> >
> >>> >> >> >> >> > Please find attached a new version
> >>> >> >> >> >>
> >>> >> >> >> >> Thanks for the new version. Unfortunately this time
> regression
> >>> >> test
> >>> >> >> >> >> fails at:
> >>> >> >> >> >>
> >>> >> >> >> >> > Waiting for command timeout...
> >>> >> >> >> >> > fail: command timeout not detected
> >>> >> >> >> >>
> >>> >> >> >> >> Attached is the pgpool.log.
> >>> >> >> >> >>
> >>> >> >> >> >> Best regards,
> >>> >> >> >> >> --
> >>> >> >> >> >> Tatsuo Ishii
> >>> >> >> >> >> SRA OSS K.K.
> >>> >> >> >> >> English: http://www.sraoss.co.jp/index_en/
> >>> >> >> >> >> Japanese:http://www.sraoss.co.jp
> >>> >> >> >> >>
> >>> >> >> >> >> > On Mon, Nov 3, 2025 at 9:05 AM Tatsuo Ishii <
> >>> >> [email protected]>
> >>> >> >> >> >> wrote:
> >>> >> >> >> >> >
> >>> >> >> >> >> >> > thanks and sorry for the issues, please find attached
> >>> updated
> >>> >> >> >> version.
> >>> >> >> >> >> >>
> >>> >> >> >> >> >> No problem.
> >>> >> >> >> >> >>
> >>> >> >> >> >> >> This time the patch applies fine, no compiler warnings.
> >>> >> However,
> >>> >> >> >> >> >> regression test did not passed here (on Ubuntu 24 LTS if
> >>> this
> >>> >> >> >> >> >> matters).  So I looked into
> >>> >> >> >> >> >>
> >>> >> src/test/regression/tests/041.external_replication_delay/test.sh a
> >>> >> >> >> >> >> little bit and apply attached patch (test.sh.patch). It
> >>> moved
> >>> >> >> forward
> >>> >> >> >> >> >> partially but failed at:
> >>> >> >> >> >> >>
> >>> >> >> >> >> >> fail: command execution failure not detected
> >>> >> >> >> >> >>
> >>> >> >> >> >> >> Please find attached
> >>> >> >> >> >> >>
> >>> >> >> >> >>
> >>> >> >> >>
> >>> >> >>
> >>> >>
> >>>
> src/test/regression/tests/041.external_replication_delay/testdir/pgpool.log
> >>> >> >> >> >> >> and
> src/test/regression/log/041.external_replication_delay.
> >>> >> >> >> >> >>
> >>> >> >> >> >> >> Best regards,
> >>> >> >> >> >> >> --
> >>> >> >> >> >> >> Tatsuo Ishii
> >>> >> >> >> >> >> SRA OSS K.K.
> >>> >> >> >> >> >> English: http://www.sraoss.co.jp/index_en/
> >>> >> >> >> >> >> Japanese:http://www.sraoss.co.jp
> >>> >> >> >> >> >>
> >>> >> >> >> >> >
> >>> >> >> >> >> >
> >>> >> >> >> >> > --
> >>> >> >> >> >> > Nadav Shatz
> >>> >> >> >> >> > Tailor Brands | CTO
> >>> >> >> >> >>
> >>> >> >> >> >
> >>> >> >> >> >
> >>> >> >> >> > --
> >>> >> >> >> > Nadav Shatz
> >>> >> >> >> > Tailor Brands | CTO
> >>> >> >> >>
> >>> >> >> >
> >>> >> >> >
> >>> >> >> > --
> >>> >> >> > Nadav Shatz
> >>> >> >> > Tailor Brands | CTO
> >>> >> >>
> >>> >> >
> >>> >> >
> >>> >> > --
> >>> >> > Nadav Shatz
> >>> >> > Tailor Brands | CTO
> >>> >>
> >>> >
> >>> >
> >>> > --
> >>> > Nadav Shatz
> >>> > Tailor Brands | CTO
> >>>
> >>
> >>
> >> --
> >> Nadav Shatz
> >> Tailor Brands | CTO
>


-- 
Nadav Shatz
Tailor Brands | CTO


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-12-28 23:48  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2025-12-28 23:48 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

> thank you! works for me. should we merge both into master or do you want me
> to send a combined one?

> do you want me to send a combined one?

Yes, please send a combined one.

I will do more tests and detailed code review.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

> On Fri, Dec 26, 2025 at 12:03 PM Tatsuo Ishii <[email protected]> wrote:
> 
>> Hi Nadav,
>>
>> I just want to make it clear. The patch should be applied on top of
>> your latest.patch.
>>
>> > (Please disregard previous mail. I seem to have mangled the message).
>> >
>> > I think I found a cause of the problem. On Linux, if SIGCHLD is
>> > ignored (set to SIG_IGN), waitpid() cannot get proper child status.
>> > Because the kernel relcaims the resource for the child process to not
>> > make the child process a zombie. And this makes waitpid() to fail with
>> > ECHLD. Since the return of waitpid() is not checked, I did not notice
>> > the waitpid() failure (I recommend to check the return value of
>> > waitpid()).
>> >
>> >       /* set up signal handlers */
>> >       signal(SIGALRM, SIG_DFL);
>> >       signal(SIGTERM, my_signal_handler);
>> >       signal(SIGINT, my_signal_handler);
>> >       signal(SIGHUP, reload_config_handler);
>> >       signal(SIGQUIT, my_signal_handler);
>> >       signal(SIGCHLD, SIG_IGN);       <--- SIGCHLD is ignored
>> >       signal(SIGUSR1, my_signal_handler);
>> >       signal(SIGUSR2, SIG_IGN);
>> >       signal(SIGPIPE, SIG_IGN);
>> >
>> > To fix this, either change the line above to:
>> >
>> >       signal(SIGCHLD, SIG_DFL);
>> > or
>> >       signal(SIGCHLD, my_signal_handler);
>> >       and modify my_signal_handler.
>> >
>> > I recommend the latter, because it does not depend on the default
>> > behavior of SIGCHLD, which might be different per platform.
>> > Attached is the patch to do this. (and run pgindent).
>> > I also notice that something like:
>> >
>> >               /* Count tokens in output for validation */
>> >               char *line_copy = pstrdup(line);
>> >               char *temp_token = strtok(line_copy, " \t\n");
>> >
>> > You should declare line_copy and temp_token in the begging of the code
>> > block (or in the outer block).  The forward declaration is recommended
>> > coding style in Pgpool-II (and PostgreSQL). Same thing can be said to
>> > some other variables.
>> >
>> > Best regards,
>> > --
>> > Tatsuo Ishii
>> > SRA OSS K.K.
>> > English: http://www.sraoss.co.jp/index_en/
>> > Japanese:http://www.sraoss.co.jp
>> >
>> >> Hi Tatsuo,
>> >>
>> >> Thank you for the note.
>> >>
>> >> I've removed the docker stuff. started working in an ubuntu 24 VM to
>> match
>> >> the setup. hopefully the results will be better, had so many issues
>> >> compiling and testing before that stuff wasn't properly formulated.
>> >>
>> >> Attaching the latest patch.
>> >>
>> >> this is what i'm seeing:
>> >> adav@lima-dev:/src/pgpool2/src/test/regression$ PGHOST=/tmp
>> ./regress.sh -p
>> >> /usr/bin 041.external_replication_delay
>> >> creating pgpool-II temporary installation ...
>> >> moving pgpool_setup to temporary installation path ...
>> >> moving watchdog_setup to temporary installation path ...
>> >> using pgpool-II at /src/pgpool2/src/test/regression/temp/installed
>> >> *************************
>> >> REGRESSION MODE          : install
>> >> Pgpool-II version        : pgpool-II version 4.8devel (mitsukakeboshi)
>> >> Pgpool-II install path   :
>> /src/pgpool2/src/test/regression/temp/installed
>> >> PostgreSQL bin           : /usr/lib/postgresql/16/bin
>> >> PostgreSQL Major version : 16
>> >> pgbench                  : /usr/lib/postgresql/16/bin/pgbench
>> >> PostgreSQL jdbc          :
>> >> /usr/local/pgsql/share/postgresql-9.2-1003.jdbc4.jar
>> >> *************************
>> >> testing 041.external_replication_delay...ok.
>> >> out of 1 ok:1 failed:0 timeout:0
>> >>
>> >>
>> >>
>> >> On Tue, Dec 23, 2025 at 10:46 AM Tatsuo Ishii <[email protected]>
>> wrote:
>> >>
>> >>> > Hi Tatsuo,
>> >>> >
>> >>> > I'km running into issues testing this and have created a full docker
>> >>> > compose setup - can you please point me to up to date guides on the
>> best
>> >>> > way to run the tests so i know we're doing it the same way?
>> >>> >
>> >>> > Thank you for all your help!
>> >>>
>> >>> I have run the regression test on the Pgpool-II master branch on my
>> >>> Ubuntu 24 box.
>> >>>
>> >>> cd pgpool2/src/test/regression
>> >>> ./regress.sh 041
>> >>>
>> >>> This time I noticed:
>> >>>
>> >>> - The patch does not named with version number
>> >>> - The patch creates .dockerignore and docker/ directory.
>> >>>
>> >>> Are they intended? I am asking because they are different from the
>> >>> previous version.
>> >>>
>> >>> > On Tue, Dec 23, 2025 at 2:13 AM Tatsuo Ishii <[email protected]>
>> >>> wrote:
>> >>> >
>> >>> >> > I think everything is passing now. new version attached.
>> >>> >>
>> >>> >> Unfortunately Test1 did not pass.
>> >>> >>
>> >>> >> === Test1: Basic external command with integer millisecond values
>> ===
>> >>> >> waiting for server to start....1438600 2025-12-23 09:09:48.337 JST
>> LOG:
>> >>> >> redirecting log output to logging collector process
>> >>> >> 1438600 2025-12-23 09:09:48.337 JST HINT:  Future log output will
>> appear
>> >>> >> in directory "log".
>> >>> >>  done
>> >>> >> server started
>> >>> >> waiting for server to start....1438617 2025-12-23 09:09:48.443 JST
>> LOG:
>> >>> >> redirecting log output to logging collector process
>> >>> >> 1438617 2025-12-23 09:09:48.443 JST HINT:  Future log output will
>> appear
>> >>> >> in directory "log".
>> >>> >>  done
>> >>> >> server started
>> >>> >> waiting for server to start....1438634 2025-12-23 09:09:48.561 JST
>> LOG:
>> >>> >> redirecting log output to logging collector process
>> >>> >> 1438634 2025-12-23 09:09:48.561 JST HINT:  Future log output will
>> appear
>> >>> >> in directory "log".
>> >>> >>  done
>> >>> >> server started
>> >>> >> CREATE TABLE
>> >>> >> Waiting for sr_check to run...
>> >>> >> Command executed after 1 seconds
>> >>> >>  node_id | hostname  | port  | status | pg_status | lb_weight |
>> role
>> >>>  |
>> >>> >> pg_role | select_cnt | load_balance_node | replication_delay |
>> >>> >> replication_state | replication_sync_state | last_status_change
>> >>> >>
>> >>> >>
>> >>>
>> ---------+-----------+-------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
>> >>> >>  0       | localhost | 11002 | up     | up        | 0.333333  |
>> primary
>> >>> |
>> >>> >> primary | 0          | true              | 0                 |
>> >>> >>      |                        | 2025-12-23 09:09:49
>> >>> >>  1       | localhost | 11003 | up     | up        | 0.333333  |
>> standby
>> >>> |
>> >>> >> standby | 0          | false             | 0                 |
>> >>> >>      |                        | 2025-12-23 09:09:49
>> >>> >>  2       | localhost | 11004 | up     | up        | 0.333333  |
>> standby
>> >>> |
>> >>> >> standby | 0          | false             | 0                 |
>> >>> >>      |                        | 2025-12-23 09:09:49
>> >>> >> (3 rows)
>> >>> >>
>> >>> >> fail: external command delay logging not found
>> >>> >>
>> >>> >> > On Mon, Nov 24, 2025 at 9:41 AM Tatsuo Ishii <
>> [email protected]>
>> >>> >> wrote:
>> >>> >> >
>> >>> >> >> Thank you for updating the patch! This time the patch applies
>> without
>> >>> >> >> any issue and compiles fine. Unfortunately regression test
>> failed.
>> >>> >> >>
>> >>> >> >> testing 041.external_replication_delay...failed.
>> >>> >> >>
>> >>> >> >> From the regression log, it seems Test7 failed.
>> >>> >> >>
>> >>> >> >>
>> >>> >>
>> >>>
>> ------------------------------------------------------------------------------
>> >>> >> >> === Test7: Command timeout handling ===
>> >>> >> >> waiting for server to start....411181 2025-11-24 16:31:05.244 JST
>> >>> LOG:
>> >>> >> >> redirecting log output to logging collector process
>> >>> >> >> 411181 2025-11-24 16:31:05.244 JST HINT:  Future log output will
>> >>> appear
>> >>> >> in
>> >>> >> >> directory "log".
>> >>> >> >>  done
>> >>> >> >> server started
>> >>> >> >> waiting for server to start....411196 2025-11-24 16:31:05.352 JST
>> >>> LOG:
>> >>> >> >> redirecting log output to logging collector process
>> >>> >> >> 411196 2025-11-24 16:31:05.352 JST HINT:  Future log output will
>> >>> appear
>> >>> >> in
>> >>> >> >> directory "log".
>> >>> >> >>  done
>> >>> >> >> server started
>> >>> >> >> waiting for server to start....411213 2025-11-24 16:31:05.461 JST
>> >>> LOG:
>> >>> >> >> redirecting log output to logging collector process
>> >>> >> >> 411213 2025-11-24 16:31:05.461 JST HINT:  Future log output will
>> >>> appear
>> >>> >> in
>> >>> >> >> directory "log".
>> >>> >> >>  done
>> >>> >> >> server started
>> >>> >> >> Waiting for command timeout...
>> >>> >> >> fail: command timeout not detected
>> >>> >> >>
>> >>> >> >>
>> >>> >>
>> >>>
>> ------------------------------------------------------------------------------
>> >>> >> >>
>> >>> >> >> Attached is the pgpool.log. If you need more info, please let me
>> >>> know.
>> >>> >> >>
>> >>> >> >> Best regards,
>> >>> >> >> --
>> >>> >> >> Tatsuo Ishii
>> >>> >> >> SRA OSS K.K.
>> >>> >> >> English: http://www.sraoss.co.jp/index_en/
>> >>> >> >> Japanese:http://www.sraoss.co.jp
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> > Hi Tatsuo,
>> >>> >> >> >
>> >>> >> >> > Sorry again, this was due to the separation of 2 patches and i
>> only
>> >>> >> sent
>> >>> >> >> > the one.
>> >>> >> >> >
>> >>> >> >> > I've merged it into 1 commit and 1 patch and rebased over
>> master to
>> >>> >> avoid
>> >>> >> >> > these issues moving forward.
>> >>> >> >> >
>> >>> >> >> > PFA latest version
>> >>> >> >> >
>> >>> >> >> > On Thu, Nov 20, 2025 at 1:09 AM Tatsuo Ishii <
>> [email protected]
>> >>> >
>> >>> >> >> wrote:
>> >>> >> >> >
>> >>> >> >> >> Hi Nadav,
>> >>> >> >> >>
>> >>> >> >> >> Thank you for new patch.
>> >>> >> >> >> Unfortunately the patch did not apply to current master.
>> >>> >> >> >>
>> >>> >> >> >> $ git apply
>> >>> >> >> >>
>> ~/0001-Fix-multiple-issues-in-external-replication-delay-fe.patch
>> >>> >> >> >> error: patch failed:
>> >>> >> src/streaming_replication/pool_worker_child.c:694
>> >>> >> >> >> error: src/streaming_replication/pool_worker_child.c: patch
>> does
>> >>> not
>> >>> >> >> apply
>> >>> >> >> >>
>> >>> >> >> >> Maybe the patch is on top of your previous patch?
>> >>> >> >> >>
>> >>> >> >> >> Also I suggest to use "-v" option of "git format-patch" to
>> add the
>> >>> >> >> >> patch version number so that we can easily know which patch
>> is the
>> >>> >> >> >> latest.
>> >>> >> >> >>
>> >>> >> >> >> Best regards,
>> >>> >> >> >> --
>> >>> >> >> >> Tatsuo Ishii
>> >>> >> >> >> SRA OSS K.K.
>> >>> >> >> >> English: http://www.sraoss.co.jp/index_en/
>> >>> >> >> >> Japanese:http://www.sraoss.co.jp
>> >>> >> >> >>
>> >>> >> >> >> > Hi Tatsuo,
>> >>> >> >> >> >
>> >>> >> >> >> > Please see attached an updated version.
>> >>> >> >> >> >
>> >>> >> >> >> > thank you
>> >>> >> >> >> >
>> >>> >> >> >> > On Fri, Nov 7, 2025 at 2:07 AM Tatsuo Ishii <
>> >>> [email protected]>
>> >>> >> >> >> wrote:
>> >>> >> >> >> >
>> >>> >> >> >> >> > Sorry for that - thanks for the patch.
>> >>> >> >> >> >> >
>> >>> >> >> >> >> > Please find attached a new version
>> >>> >> >> >> >>
>> >>> >> >> >> >> Thanks for the new version. Unfortunately this time
>> regression
>> >>> >> test
>> >>> >> >> >> >> fails at:
>> >>> >> >> >> >>
>> >>> >> >> >> >> > Waiting for command timeout...
>> >>> >> >> >> >> > fail: command timeout not detected
>> >>> >> >> >> >>
>> >>> >> >> >> >> Attached is the pgpool.log.
>> >>> >> >> >> >>
>> >>> >> >> >> >> Best regards,
>> >>> >> >> >> >> --
>> >>> >> >> >> >> Tatsuo Ishii
>> >>> >> >> >> >> SRA OSS K.K.
>> >>> >> >> >> >> English: http://www.sraoss.co.jp/index_en/
>> >>> >> >> >> >> Japanese:http://www.sraoss.co.jp
>> >>> >> >> >> >>
>> >>> >> >> >> >> > On Mon, Nov 3, 2025 at 9:05 AM Tatsuo Ishii <
>> >>> >> [email protected]>
>> >>> >> >> >> >> wrote:
>> >>> >> >> >> >> >
>> >>> >> >> >> >> >> > thanks and sorry for the issues, please find attached
>> >>> updated
>> >>> >> >> >> version.
>> >>> >> >> >> >> >>
>> >>> >> >> >> >> >> No problem.
>> >>> >> >> >> >> >>
>> >>> >> >> >> >> >> This time the patch applies fine, no compiler warnings.
>> >>> >> However,
>> >>> >> >> >> >> >> regression test did not passed here (on Ubuntu 24 LTS if
>> >>> this
>> >>> >> >> >> >> >> matters).  So I looked into
>> >>> >> >> >> >> >>
>> >>> >> src/test/regression/tests/041.external_replication_delay/test.sh a
>> >>> >> >> >> >> >> little bit and apply attached patch (test.sh.patch). It
>> >>> moved
>> >>> >> >> forward
>> >>> >> >> >> >> >> partially but failed at:
>> >>> >> >> >> >> >>
>> >>> >> >> >> >> >> fail: command execution failure not detected
>> >>> >> >> >> >> >>
>> >>> >> >> >> >> >> Please find attached
>> >>> >> >> >> >> >>
>> >>> >> >> >> >>
>> >>> >> >> >>
>> >>> >> >>
>> >>> >>
>> >>>
>> src/test/regression/tests/041.external_replication_delay/testdir/pgpool.log
>> >>> >> >> >> >> >> and
>> src/test/regression/log/041.external_replication_delay.
>> >>> >> >> >> >> >>
>> >>> >> >> >> >> >> Best regards,
>> >>> >> >> >> >> >> --
>> >>> >> >> >> >> >> Tatsuo Ishii
>> >>> >> >> >> >> >> SRA OSS K.K.
>> >>> >> >> >> >> >> English: http://www.sraoss.co.jp/index_en/
>> >>> >> >> >> >> >> Japanese:http://www.sraoss.co.jp
>> >>> >> >> >> >> >>
>> >>> >> >> >> >> >
>> >>> >> >> >> >> >
>> >>> >> >> >> >> > --
>> >>> >> >> >> >> > Nadav Shatz
>> >>> >> >> >> >> > Tailor Brands | CTO
>> >>> >> >> >> >>
>> >>> >> >> >> >
>> >>> >> >> >> >
>> >>> >> >> >> > --
>> >>> >> >> >> > Nadav Shatz
>> >>> >> >> >> > Tailor Brands | CTO
>> >>> >> >> >>
>> >>> >> >> >
>> >>> >> >> >
>> >>> >> >> > --
>> >>> >> >> > Nadav Shatz
>> >>> >> >> > Tailor Brands | CTO
>> >>> >> >>
>> >>> >> >
>> >>> >> >
>> >>> >> > --
>> >>> >> > Nadav Shatz
>> >>> >> > Tailor Brands | CTO
>> >>> >>
>> >>> >
>> >>> >
>> >>> > --
>> >>> > Nadav Shatz
>> >>> > Tailor Brands | CTO
>> >>>
>> >>
>> >>
>> >> --
>> >> Nadav Shatz
>> >> Tailor Brands | CTO
>>
> 
> 
> -- 
> Nadav Shatz
> Tailor Brands | CTO
<


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-12-28 23:58  Tatsuo Ishii <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2025-12-28 23:58 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

>> thank you! works for me. should we merge both into master or do you want me
>> to send a combined one?
> 
>> do you want me to send a combined one?
> 
> Yes, please send a combined one.
> 
> I will do more tests and detailed code review.

Also when combing the patches, please correspond followings.

>>> > I also notice that something like:
>>> >
>>> >               /* Count tokens in output for validation */
>>> >               char *line_copy = pstrdup(line);
>>> >               char *temp_token = strtok(line_copy, " \t\n");
>>> >
>>> > You should declare line_copy and temp_token in the begging of the code
>>> > block (or in the outer block).  The forward declaration is recommended
>>> > coding style in Pgpool-II (and PostgreSQL). Same thing can be said to
>>> > some other variables.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

>> On Fri, Dec 26, 2025 at 12:03 PM Tatsuo Ishii <[email protected]> wrote:
>> 
>>> Hi Nadav,
>>>
>>> I just want to make it clear. The patch should be applied on top of
>>> your latest.patch.
>>>
>>> > (Please disregard previous mail. I seem to have mangled the message).
>>> >
>>> > I think I found a cause of the problem. On Linux, if SIGCHLD is
>>> > ignored (set to SIG_IGN), waitpid() cannot get proper child status.
>>> > Because the kernel relcaims the resource for the child process to not
>>> > make the child process a zombie. And this makes waitpid() to fail with
>>> > ECHLD. Since the return of waitpid() is not checked, I did not notice
>>> > the waitpid() failure (I recommend to check the return value of
>>> > waitpid()).
>>> >
>>> >       /* set up signal handlers */
>>> >       signal(SIGALRM, SIG_DFL);
>>> >       signal(SIGTERM, my_signal_handler);
>>> >       signal(SIGINT, my_signal_handler);
>>> >       signal(SIGHUP, reload_config_handler);
>>> >       signal(SIGQUIT, my_signal_handler);
>>> >       signal(SIGCHLD, SIG_IGN);       <--- SIGCHLD is ignored
>>> >       signal(SIGUSR1, my_signal_handler);
>>> >       signal(SIGUSR2, SIG_IGN);
>>> >       signal(SIGPIPE, SIG_IGN);
>>> >
>>> > To fix this, either change the line above to:
>>> >
>>> >       signal(SIGCHLD, SIG_DFL);
>>> > or
>>> >       signal(SIGCHLD, my_signal_handler);
>>> >       and modify my_signal_handler.
>>> >
>>> > I recommend the latter, because it does not depend on the default
>>> > behavior of SIGCHLD, which might be different per platform.
>>> > Attached is the patch to do this. (and run pgindent).
>>> > I also notice that something like:
>>> >
>>> >               /* Count tokens in output for validation */
>>> >               char *line_copy = pstrdup(line);
>>> >               char *temp_token = strtok(line_copy, " \t\n");
>>> >
>>> > You should declare line_copy and temp_token in the begging of the code
>>> > block (or in the outer block).  The forward declaration is recommended
>>> > coding style in Pgpool-II (and PostgreSQL). Same thing can be said to
>>> > some other variables.
>>> >
>>> > Best regards,
>>> > --
>>> > Tatsuo Ishii
>>> > SRA OSS K.K.
>>> > English: http://www.sraoss.co.jp/index_en/
>>> > Japanese:http://www.sraoss.co.jp
>>> >
>>> >> Hi Tatsuo,
>>> >>
>>> >> Thank you for the note.
>>> >>
>>> >> I've removed the docker stuff. started working in an ubuntu 24 VM to
>>> match
>>> >> the setup. hopefully the results will be better, had so many issues
>>> >> compiling and testing before that stuff wasn't properly formulated.
>>> >>
>>> >> Attaching the latest patch.
>>> >>
>>> >> this is what i'm seeing:
>>> >> adav@lima-dev:/src/pgpool2/src/test/regression$ PGHOST=/tmp
>>> ./regress.sh -p
>>> >> /usr/bin 041.external_replication_delay
>>> >> creating pgpool-II temporary installation ...
>>> >> moving pgpool_setup to temporary installation path ...
>>> >> moving watchdog_setup to temporary installation path ...
>>> >> using pgpool-II at /src/pgpool2/src/test/regression/temp/installed
>>> >> *************************
>>> >> REGRESSION MODE          : install
>>> >> Pgpool-II version        : pgpool-II version 4.8devel (mitsukakeboshi)
>>> >> Pgpool-II install path   :
>>> /src/pgpool2/src/test/regression/temp/installed
>>> >> PostgreSQL bin           : /usr/lib/postgresql/16/bin
>>> >> PostgreSQL Major version : 16
>>> >> pgbench                  : /usr/lib/postgresql/16/bin/pgbench
>>> >> PostgreSQL jdbc          :
>>> >> /usr/local/pgsql/share/postgresql-9.2-1003.jdbc4.jar
>>> >> *************************
>>> >> testing 041.external_replication_delay...ok.
>>> >> out of 1 ok:1 failed:0 timeout:0
>>> >>
>>> >>
>>> >>
>>> >> On Tue, Dec 23, 2025 at 10:46 AM Tatsuo Ishii <[email protected]>
>>> wrote:
>>> >>
>>> >>> > Hi Tatsuo,
>>> >>> >
>>> >>> > I'km running into issues testing this and have created a full docker
>>> >>> > compose setup - can you please point me to up to date guides on the
>>> best
>>> >>> > way to run the tests so i know we're doing it the same way?
>>> >>> >
>>> >>> > Thank you for all your help!
>>> >>>
>>> >>> I have run the regression test on the Pgpool-II master branch on my
>>> >>> Ubuntu 24 box.
>>> >>>
>>> >>> cd pgpool2/src/test/regression
>>> >>> ./regress.sh 041
>>> >>>
>>> >>> This time I noticed:
>>> >>>
>>> >>> - The patch does not named with version number
>>> >>> - The patch creates .dockerignore and docker/ directory.
>>> >>>
>>> >>> Are they intended? I am asking because they are different from the
>>> >>> previous version.
>>> >>>
>>> >>> > On Tue, Dec 23, 2025 at 2:13 AM Tatsuo Ishii <[email protected]>
>>> >>> wrote:
>>> >>> >
>>> >>> >> > I think everything is passing now. new version attached.
>>> >>> >>
>>> >>> >> Unfortunately Test1 did not pass.
>>> >>> >>
>>> >>> >> === Test1: Basic external command with integer millisecond values
>>> ===
>>> >>> >> waiting for server to start....1438600 2025-12-23 09:09:48.337 JST
>>> LOG:
>>> >>> >> redirecting log output to logging collector process
>>> >>> >> 1438600 2025-12-23 09:09:48.337 JST HINT:  Future log output will
>>> appear
>>> >>> >> in directory "log".
>>> >>> >>  done
>>> >>> >> server started
>>> >>> >> waiting for server to start....1438617 2025-12-23 09:09:48.443 JST
>>> LOG:
>>> >>> >> redirecting log output to logging collector process
>>> >>> >> 1438617 2025-12-23 09:09:48.443 JST HINT:  Future log output will
>>> appear
>>> >>> >> in directory "log".
>>> >>> >>  done
>>> >>> >> server started
>>> >>> >> waiting for server to start....1438634 2025-12-23 09:09:48.561 JST
>>> LOG:
>>> >>> >> redirecting log output to logging collector process
>>> >>> >> 1438634 2025-12-23 09:09:48.561 JST HINT:  Future log output will
>>> appear
>>> >>> >> in directory "log".
>>> >>> >>  done
>>> >>> >> server started
>>> >>> >> CREATE TABLE
>>> >>> >> Waiting for sr_check to run...
>>> >>> >> Command executed after 1 seconds
>>> >>> >>  node_id | hostname  | port  | status | pg_status | lb_weight |
>>> role
>>> >>>  |
>>> >>> >> pg_role | select_cnt | load_balance_node | replication_delay |
>>> >>> >> replication_state | replication_sync_state | last_status_change
>>> >>> >>
>>> >>> >>
>>> >>>
>>> ---------+-----------+-------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
>>> >>> >>  0       | localhost | 11002 | up     | up        | 0.333333  |
>>> primary
>>> >>> |
>>> >>> >> primary | 0          | true              | 0                 |
>>> >>> >>      |                        | 2025-12-23 09:09:49
>>> >>> >>  1       | localhost | 11003 | up     | up        | 0.333333  |
>>> standby
>>> >>> |
>>> >>> >> standby | 0          | false             | 0                 |
>>> >>> >>      |                        | 2025-12-23 09:09:49
>>> >>> >>  2       | localhost | 11004 | up     | up        | 0.333333  |
>>> standby
>>> >>> |
>>> >>> >> standby | 0          | false             | 0                 |
>>> >>> >>      |                        | 2025-12-23 09:09:49
>>> >>> >> (3 rows)
>>> >>> >>
>>> >>> >> fail: external command delay logging not found
>>> >>> >>
>>> >>> >> > On Mon, Nov 24, 2025 at 9:41 AM Tatsuo Ishii <
>>> [email protected]>
>>> >>> >> wrote:
>>> >>> >> >
>>> >>> >> >> Thank you for updating the patch! This time the patch applies
>>> without
>>> >>> >> >> any issue and compiles fine. Unfortunately regression test
>>> failed.
>>> >>> >> >>
>>> >>> >> >> testing 041.external_replication_delay...failed.
>>> >>> >> >>
>>> >>> >> >> From the regression log, it seems Test7 failed.
>>> >>> >> >>
>>> >>> >> >>
>>> >>> >>
>>> >>>
>>> ------------------------------------------------------------------------------
>>> >>> >> >> === Test7: Command timeout handling ===
>>> >>> >> >> waiting for server to start....411181 2025-11-24 16:31:05.244 JST
>>> >>> LOG:
>>> >>> >> >> redirecting log output to logging collector process
>>> >>> >> >> 411181 2025-11-24 16:31:05.244 JST HINT:  Future log output will
>>> >>> appear
>>> >>> >> in
>>> >>> >> >> directory "log".
>>> >>> >> >>  done
>>> >>> >> >> server started
>>> >>> >> >> waiting for server to start....411196 2025-11-24 16:31:05.352 JST
>>> >>> LOG:
>>> >>> >> >> redirecting log output to logging collector process
>>> >>> >> >> 411196 2025-11-24 16:31:05.352 JST HINT:  Future log output will
>>> >>> appear
>>> >>> >> in
>>> >>> >> >> directory "log".
>>> >>> >> >>  done
>>> >>> >> >> server started
>>> >>> >> >> waiting for server to start....411213 2025-11-24 16:31:05.461 JST
>>> >>> LOG:
>>> >>> >> >> redirecting log output to logging collector process
>>> >>> >> >> 411213 2025-11-24 16:31:05.461 JST HINT:  Future log output will
>>> >>> appear
>>> >>> >> in
>>> >>> >> >> directory "log".
>>> >>> >> >>  done
>>> >>> >> >> server started
>>> >>> >> >> Waiting for command timeout...
>>> >>> >> >> fail: command timeout not detected
>>> >>> >> >>
>>> >>> >> >>
>>> >>> >>
>>> >>>
>>> ------------------------------------------------------------------------------
>>> >>> >> >>
>>> >>> >> >> Attached is the pgpool.log. If you need more info, please let me
>>> >>> know.
>>> >>> >> >>
>>> >>> >> >> Best regards,
>>> >>> >> >> --
>>> >>> >> >> Tatsuo Ishii
>>> >>> >> >> SRA OSS K.K.
>>> >>> >> >> English: http://www.sraoss.co.jp/index_en/
>>> >>> >> >> Japanese:http://www.sraoss.co.jp
>>> >>> >> >>
>>> >>> >> >>
>>> >>> >> >> > Hi Tatsuo,
>>> >>> >> >> >
>>> >>> >> >> > Sorry again, this was due to the separation of 2 patches and i
>>> only
>>> >>> >> sent
>>> >>> >> >> > the one.
>>> >>> >> >> >
>>> >>> >> >> > I've merged it into 1 commit and 1 patch and rebased over
>>> master to
>>> >>> >> avoid
>>> >>> >> >> > these issues moving forward.
>>> >>> >> >> >
>>> >>> >> >> > PFA latest version
>>> >>> >> >> >
>>> >>> >> >> > On Thu, Nov 20, 2025 at 1:09 AM Tatsuo Ishii <
>>> [email protected]
>>> >>> >
>>> >>> >> >> wrote:
>>> >>> >> >> >
>>> >>> >> >> >> Hi Nadav,
>>> >>> >> >> >>
>>> >>> >> >> >> Thank you for new patch.
>>> >>> >> >> >> Unfortunately the patch did not apply to current master.
>>> >>> >> >> >>
>>> >>> >> >> >> $ git apply
>>> >>> >> >> >>
>>> ~/0001-Fix-multiple-issues-in-external-replication-delay-fe.patch
>>> >>> >> >> >> error: patch failed:
>>> >>> >> src/streaming_replication/pool_worker_child.c:694
>>> >>> >> >> >> error: src/streaming_replication/pool_worker_child.c: patch
>>> does
>>> >>> not
>>> >>> >> >> apply
>>> >>> >> >> >>
>>> >>> >> >> >> Maybe the patch is on top of your previous patch?
>>> >>> >> >> >>
>>> >>> >> >> >> Also I suggest to use "-v" option of "git format-patch" to
>>> add the
>>> >>> >> >> >> patch version number so that we can easily know which patch
>>> is the
>>> >>> >> >> >> latest.
>>> >>> >> >> >>
>>> >>> >> >> >> Best regards,
>>> >>> >> >> >> --
>>> >>> >> >> >> Tatsuo Ishii
>>> >>> >> >> >> SRA OSS K.K.
>>> >>> >> >> >> English: http://www.sraoss.co.jp/index_en/
>>> >>> >> >> >> Japanese:http://www.sraoss.co.jp
>>> >>> >> >> >>
>>> >>> >> >> >> > Hi Tatsuo,
>>> >>> >> >> >> >
>>> >>> >> >> >> > Please see attached an updated version.
>>> >>> >> >> >> >
>>> >>> >> >> >> > thank you
>>> >>> >> >> >> >
>>> >>> >> >> >> > On Fri, Nov 7, 2025 at 2:07 AM Tatsuo Ishii <
>>> >>> [email protected]>
>>> >>> >> >> >> wrote:
>>> >>> >> >> >> >
>>> >>> >> >> >> >> > Sorry for that - thanks for the patch.
>>> >>> >> >> >> >> >
>>> >>> >> >> >> >> > Please find attached a new version
>>> >>> >> >> >> >>
>>> >>> >> >> >> >> Thanks for the new version. Unfortunately this time
>>> regression
>>> >>> >> test
>>> >>> >> >> >> >> fails at:
>>> >>> >> >> >> >>
>>> >>> >> >> >> >> > Waiting for command timeout...
>>> >>> >> >> >> >> > fail: command timeout not detected
>>> >>> >> >> >> >>
>>> >>> >> >> >> >> Attached is the pgpool.log.
>>> >>> >> >> >> >>
>>> >>> >> >> >> >> Best regards,
>>> >>> >> >> >> >> --
>>> >>> >> >> >> >> Tatsuo Ishii
>>> >>> >> >> >> >> SRA OSS K.K.
>>> >>> >> >> >> >> English: http://www.sraoss.co.jp/index_en/
>>> >>> >> >> >> >> Japanese:http://www.sraoss.co.jp
>>> >>> >> >> >> >>
>>> >>> >> >> >> >> > On Mon, Nov 3, 2025 at 9:05 AM Tatsuo Ishii <
>>> >>> >> [email protected]>
>>> >>> >> >> >> >> wrote:
>>> >>> >> >> >> >> >
>>> >>> >> >> >> >> >> > thanks and sorry for the issues, please find attached
>>> >>> updated
>>> >>> >> >> >> version.
>>> >>> >> >> >> >> >>
>>> >>> >> >> >> >> >> No problem.
>>> >>> >> >> >> >> >>
>>> >>> >> >> >> >> >> This time the patch applies fine, no compiler warnings.
>>> >>> >> However,
>>> >>> >> >> >> >> >> regression test did not passed here (on Ubuntu 24 LTS if
>>> >>> this
>>> >>> >> >> >> >> >> matters).  So I looked into
>>> >>> >> >> >> >> >>
>>> >>> >> src/test/regression/tests/041.external_replication_delay/test.sh a
>>> >>> >> >> >> >> >> little bit and apply attached patch (test.sh.patch). It
>>> >>> moved
>>> >>> >> >> forward
>>> >>> >> >> >> >> >> partially but failed at:
>>> >>> >> >> >> >> >>
>>> >>> >> >> >> >> >> fail: command execution failure not detected
>>> >>> >> >> >> >> >>
>>> >>> >> >> >> >> >> Please find attached
>>> >>> >> >> >> >> >>
>>> >>> >> >> >> >>
>>> >>> >> >> >>
>>> >>> >> >>
>>> >>> >>
>>> >>>
>>> src/test/regression/tests/041.external_replication_delay/testdir/pgpool.log
>>> >>> >> >> >> >> >> and
>>> src/test/regression/log/041.external_replication_delay.
>>> >>> >> >> >> >> >>
>>> >>> >> >> >> >> >> Best regards,
>>> >>> >> >> >> >> >> --
>>> >>> >> >> >> >> >> Tatsuo Ishii
>>> >>> >> >> >> >> >> SRA OSS K.K.
>>> >>> >> >> >> >> >> English: http://www.sraoss.co.jp/index_en/
>>> >>> >> >> >> >> >> Japanese:http://www.sraoss.co.jp
>>> >>> >> >> >> >> >>
>>> >>> >> >> >> >> >
>>> >>> >> >> >> >> >
>>> >>> >> >> >> >> > --
>>> >>> >> >> >> >> > Nadav Shatz
>>> >>> >> >> >> >> > Tailor Brands | CTO
>>> >>> >> >> >> >>
>>> >>> >> >> >> >
>>> >>> >> >> >> >
>>> >>> >> >> >> > --
>>> >>> >> >> >> > Nadav Shatz
>>> >>> >> >> >> > Tailor Brands | CTO
>>> >>> >> >> >>
>>> >>> >> >> >
>>> >>> >> >> >
>>> >>> >> >> > --
>>> >>> >> >> > Nadav Shatz
>>> >>> >> >> > Tailor Brands | CTO
>>> >>> >> >>
>>> >>> >> >
>>> >>> >> >
>>> >>> >> > --
>>> >>> >> > Nadav Shatz
>>> >>> >> > Tailor Brands | CTO
>>> >>> >>
>>> >>> >
>>> >>> >
>>> >>> > --
>>> >>> > Nadav Shatz
>>> >>> > Tailor Brands | CTO
>>> >>>
>>> >>
>>> >>
>>> >> --
>>> >> Nadav Shatz
>>> >> Tailor Brands | CTO
>>>
>> 
>> 
>> -- 
>> Nadav Shatz
>> Tailor Brands | CTO
> <


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2025-12-29 09:31  Nadav Shatz <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Nadav Shatz @ 2025-12-29 09:31 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

Thanks for the help! please find attached the latest version with all
changes and test passing.

On Mon, Dec 29, 2025 at 1:58 AM Tatsuo Ishii <[email protected]> wrote:

> >> thank you! works for me. should we merge both into master or do you
> want me
> >> to send a combined one?
> >
> >> do you want me to send a combined one?
> >
> > Yes, please send a combined one.
> >
> > I will do more tests and detailed code review.
>
> Also when combing the patches, please correspond followings.
>
> >>> > I also notice that something like:
> >>> >
> >>> >               /* Count tokens in output for validation */
> >>> >               char *line_copy = pstrdup(line);
> >>> >               char *temp_token = strtok(line_copy, " \t\n");
> >>> >
> >>> > You should declare line_copy and temp_token in the begging of the
> code
> >>> > block (or in the outer block).  The forward declaration is
> recommended
> >>> > coding style in Pgpool-II (and PostgreSQL). Same thing can be said to
> >>> > some other variables.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
> >> On Fri, Dec 26, 2025 at 12:03 PM Tatsuo Ishii <[email protected]>
> wrote:
> >>
> >>> Hi Nadav,
> >>>
> >>> I just want to make it clear. The patch should be applied on top of
> >>> your latest.patch.
> >>>
> >>> > (Please disregard previous mail. I seem to have mangled the message).
> >>> >
> >>> > I think I found a cause of the problem. On Linux, if SIGCHLD is
> >>> > ignored (set to SIG_IGN), waitpid() cannot get proper child status.
> >>> > Because the kernel relcaims the resource for the child process to not
> >>> > make the child process a zombie. And this makes waitpid() to fail
> with
> >>> > ECHLD. Since the return of waitpid() is not checked, I did not notice
> >>> > the waitpid() failure (I recommend to check the return value of
> >>> > waitpid()).
> >>> >
> >>> >       /* set up signal handlers */
> >>> >       signal(SIGALRM, SIG_DFL);
> >>> >       signal(SIGTERM, my_signal_handler);
> >>> >       signal(SIGINT, my_signal_handler);
> >>> >       signal(SIGHUP, reload_config_handler);
> >>> >       signal(SIGQUIT, my_signal_handler);
> >>> >       signal(SIGCHLD, SIG_IGN);       <--- SIGCHLD is ignored
> >>> >       signal(SIGUSR1, my_signal_handler);
> >>> >       signal(SIGUSR2, SIG_IGN);
> >>> >       signal(SIGPIPE, SIG_IGN);
> >>> >
> >>> > To fix this, either change the line above to:
> >>> >
> >>> >       signal(SIGCHLD, SIG_DFL);
> >>> > or
> >>> >       signal(SIGCHLD, my_signal_handler);
> >>> >       and modify my_signal_handler.
> >>> >
> >>> > I recommend the latter, because it does not depend on the default
> >>> > behavior of SIGCHLD, which might be different per platform.
> >>> > Attached is the patch to do this. (and run pgindent).
> >>> > I also notice that something like:
> >>> >
> >>> >               /* Count tokens in output for validation */
> >>> >               char *line_copy = pstrdup(line);
> >>> >               char *temp_token = strtok(line_copy, " \t\n");
> >>> >
> >>> > You should declare line_copy and temp_token in the begging of the
> code
> >>> > block (or in the outer block).  The forward declaration is
> recommended
> >>> > coding style in Pgpool-II (and PostgreSQL). Same thing can be said to
> >>> > some other variables.
> >>> >
> >>> > Best regards,
> >>> > --
> >>> > Tatsuo Ishii
> >>> > SRA OSS K.K.
> >>> > English: http://www.sraoss.co.jp/index_en/
> >>> > Japanese:http://www.sraoss.co.jp
> >>> >
> >>> >> Hi Tatsuo,
> >>> >>
> >>> >> Thank you for the note.
> >>> >>
> >>> >> I've removed the docker stuff. started working in an ubuntu 24 VM to
> >>> match
> >>> >> the setup. hopefully the results will be better, had so many issues
> >>> >> compiling and testing before that stuff wasn't properly formulated.
> >>> >>
> >>> >> Attaching the latest patch.
> >>> >>
> >>> >> this is what i'm seeing:
> >>> >> adav@lima-dev:/src/pgpool2/src/test/regression$ PGHOST=/tmp
> >>> ./regress.sh -p
> >>> >> /usr/bin 041.external_replication_delay
> >>> >> creating pgpool-II temporary installation ...
> >>> >> moving pgpool_setup to temporary installation path ...
> >>> >> moving watchdog_setup to temporary installation path ...
> >>> >> using pgpool-II at /src/pgpool2/src/test/regression/temp/installed
> >>> >> *************************
> >>> >> REGRESSION MODE          : install
> >>> >> Pgpool-II version        : pgpool-II version 4.8devel
> (mitsukakeboshi)
> >>> >> Pgpool-II install path   :
> >>> /src/pgpool2/src/test/regression/temp/installed
> >>> >> PostgreSQL bin           : /usr/lib/postgresql/16/bin
> >>> >> PostgreSQL Major version : 16
> >>> >> pgbench                  : /usr/lib/postgresql/16/bin/pgbench
> >>> >> PostgreSQL jdbc          :
> >>> >> /usr/local/pgsql/share/postgresql-9.2-1003.jdbc4.jar
> >>> >> *************************
> >>> >> testing 041.external_replication_delay...ok.
> >>> >> out of 1 ok:1 failed:0 timeout:0
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Tue, Dec 23, 2025 at 10:46 AM Tatsuo Ishii <[email protected]
> >
> >>> wrote:
> >>> >>
> >>> >>> > Hi Tatsuo,
> >>> >>> >
> >>> >>> > I'km running into issues testing this and have created a full
> docker
> >>> >>> > compose setup - can you please point me to up to date guides on
> the
> >>> best
> >>> >>> > way to run the tests so i know we're doing it the same way?
> >>> >>> >
> >>> >>> > Thank you for all your help!
> >>> >>>
> >>> >>> I have run the regression test on the Pgpool-II master branch on my
> >>> >>> Ubuntu 24 box.
> >>> >>>
> >>> >>> cd pgpool2/src/test/regression
> >>> >>> ./regress.sh 041
> >>> >>>
> >>> >>> This time I noticed:
> >>> >>>
> >>> >>> - The patch does not named with version number
> >>> >>> - The patch creates .dockerignore and docker/ directory.
> >>> >>>
> >>> >>> Are they intended? I am asking because they are different from the
> >>> >>> previous version.
> >>> >>>
> >>> >>> > On Tue, Dec 23, 2025 at 2:13 AM Tatsuo Ishii <
> [email protected]>
> >>> >>> wrote:
> >>> >>> >
> >>> >>> >> > I think everything is passing now. new version attached.
> >>> >>> >>
> >>> >>> >> Unfortunately Test1 did not pass.
> >>> >>> >>
> >>> >>> >> === Test1: Basic external command with integer millisecond
> values
> >>> ===
> >>> >>> >> waiting for server to start....1438600 2025-12-23 09:09:48.337
> JST
> >>> LOG:
> >>> >>> >> redirecting log output to logging collector process
> >>> >>> >> 1438600 2025-12-23 09:09:48.337 JST HINT:  Future log output
> will
> >>> appear
> >>> >>> >> in directory "log".
> >>> >>> >>  done
> >>> >>> >> server started
> >>> >>> >> waiting for server to start....1438617 2025-12-23 09:09:48.443
> JST
> >>> LOG:
> >>> >>> >> redirecting log output to logging collector process
> >>> >>> >> 1438617 2025-12-23 09:09:48.443 JST HINT:  Future log output
> will
> >>> appear
> >>> >>> >> in directory "log".
> >>> >>> >>  done
> >>> >>> >> server started
> >>> >>> >> waiting for server to start....1438634 2025-12-23 09:09:48.561
> JST
> >>> LOG:
> >>> >>> >> redirecting log output to logging collector process
> >>> >>> >> 1438634 2025-12-23 09:09:48.561 JST HINT:  Future log output
> will
> >>> appear
> >>> >>> >> in directory "log".
> >>> >>> >>  done
> >>> >>> >> server started
> >>> >>> >> CREATE TABLE
> >>> >>> >> Waiting for sr_check to run...
> >>> >>> >> Command executed after 1 seconds
> >>> >>> >>  node_id | hostname  | port  | status | pg_status | lb_weight |
> >>> role
> >>> >>>  |
> >>> >>> >> pg_role | select_cnt | load_balance_node | replication_delay |
> >>> >>> >> replication_state | replication_sync_state | last_status_change
> >>> >>> >>
> >>> >>> >>
> >>> >>>
> >>>
> ---------+-----------+-------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
> >>> >>> >>  0       | localhost | 11002 | up     | up        | 0.333333  |
> >>> primary
> >>> >>> |
> >>> >>> >> primary | 0          | true              | 0                 |
> >>> >>> >>      |                        | 2025-12-23 09:09:49
> >>> >>> >>  1       | localhost | 11003 | up     | up        | 0.333333  |
> >>> standby
> >>> >>> |
> >>> >>> >> standby | 0          | false             | 0                 |
> >>> >>> >>      |                        | 2025-12-23 09:09:49
> >>> >>> >>  2       | localhost | 11004 | up     | up        | 0.333333  |
> >>> standby
> >>> >>> |
> >>> >>> >> standby | 0          | false             | 0                 |
> >>> >>> >>      |                        | 2025-12-23 09:09:49
> >>> >>> >> (3 rows)
> >>> >>> >>
> >>> >>> >> fail: external command delay logging not found
> >>> >>> >>
> >>> >>> >> > On Mon, Nov 24, 2025 at 9:41 AM Tatsuo Ishii <
> >>> [email protected]>
> >>> >>> >> wrote:
> >>> >>> >> >
> >>> >>> >> >> Thank you for updating the patch! This time the patch applies
> >>> without
> >>> >>> >> >> any issue and compiles fine. Unfortunately regression test
> >>> failed.
> >>> >>> >> >>
> >>> >>> >> >> testing 041.external_replication_delay...failed.
> >>> >>> >> >>
> >>> >>> >> >> From the regression log, it seems Test7 failed.
> >>> >>> >> >>
> >>> >>> >> >>
> >>> >>> >>
> >>> >>>
> >>>
> ------------------------------------------------------------------------------
> >>> >>> >> >> === Test7: Command timeout handling ===
> >>> >>> >> >> waiting for server to start....411181 2025-11-24
> 16:31:05.244 JST
> >>> >>> LOG:
> >>> >>> >> >> redirecting log output to logging collector process
> >>> >>> >> >> 411181 2025-11-24 16:31:05.244 JST HINT:  Future log output
> will
> >>> >>> appear
> >>> >>> >> in
> >>> >>> >> >> directory "log".
> >>> >>> >> >>  done
> >>> >>> >> >> server started
> >>> >>> >> >> waiting for server to start....411196 2025-11-24
> 16:31:05.352 JST
> >>> >>> LOG:
> >>> >>> >> >> redirecting log output to logging collector process
> >>> >>> >> >> 411196 2025-11-24 16:31:05.352 JST HINT:  Future log output
> will
> >>> >>> appear
> >>> >>> >> in
> >>> >>> >> >> directory "log".
> >>> >>> >> >>  done
> >>> >>> >> >> server started
> >>> >>> >> >> waiting for server to start....411213 2025-11-24
> 16:31:05.461 JST
> >>> >>> LOG:
> >>> >>> >> >> redirecting log output to logging collector process
> >>> >>> >> >> 411213 2025-11-24 16:31:05.461 JST HINT:  Future log output
> will
> >>> >>> appear
> >>> >>> >> in
> >>> >>> >> >> directory "log".
> >>> >>> >> >>  done
> >>> >>> >> >> server started
> >>> >>> >> >> Waiting for command timeout...
> >>> >>> >> >> fail: command timeout not detected
> >>> >>> >> >>
> >>> >>> >> >>
> >>> >>> >>
> >>> >>>
> >>>
> ------------------------------------------------------------------------------
> >>> >>> >> >>
> >>> >>> >> >> Attached is the pgpool.log. If you need more info, please
> let me
> >>> >>> know.
> >>> >>> >> >>
> >>> >>> >> >> Best regards,
> >>> >>> >> >> --
> >>> >>> >> >> Tatsuo Ishii
> >>> >>> >> >> SRA OSS K.K.
> >>> >>> >> >> English: http://www.sraoss.co.jp/index_en/
> >>> >>> >> >> Japanese:http://www.sraoss.co.jp
> >>> >>> >> >>
> >>> >>> >> >>
> >>> >>> >> >> > Hi Tatsuo,
> >>> >>> >> >> >
> >>> >>> >> >> > Sorry again, this was due to the separation of 2 patches
> and i
> >>> only
> >>> >>> >> sent
> >>> >>> >> >> > the one.
> >>> >>> >> >> >
> >>> >>> >> >> > I've merged it into 1 commit and 1 patch and rebased over
> >>> master to
> >>> >>> >> avoid
> >>> >>> >> >> > these issues moving forward.
> >>> >>> >> >> >
> >>> >>> >> >> > PFA latest version
> >>> >>> >> >> >
> >>> >>> >> >> > On Thu, Nov 20, 2025 at 1:09 AM Tatsuo Ishii <
> >>> [email protected]
> >>> >>> >
> >>> >>> >> >> wrote:
> >>> >>> >> >> >
> >>> >>> >> >> >> Hi Nadav,
> >>> >>> >> >> >>
> >>> >>> >> >> >> Thank you for new patch.
> >>> >>> >> >> >> Unfortunately the patch did not apply to current master.
> >>> >>> >> >> >>
> >>> >>> >> >> >> $ git apply
> >>> >>> >> >> >>
> >>> ~/0001-Fix-multiple-issues-in-external-replication-delay-fe.patch
> >>> >>> >> >> >> error: patch failed:
> >>> >>> >> src/streaming_replication/pool_worker_child.c:694
> >>> >>> >> >> >> error: src/streaming_replication/pool_worker_child.c:
> patch
> >>> does
> >>> >>> not
> >>> >>> >> >> apply
> >>> >>> >> >> >>
> >>> >>> >> >> >> Maybe the patch is on top of your previous patch?
> >>> >>> >> >> >>
> >>> >>> >> >> >> Also I suggest to use "-v" option of "git format-patch" to
> >>> add the
> >>> >>> >> >> >> patch version number so that we can easily know which
> patch
> >>> is the
> >>> >>> >> >> >> latest.
> >>> >>> >> >> >>
> >>> >>> >> >> >> Best regards,
> >>> >>> >> >> >> --
> >>> >>> >> >> >> Tatsuo Ishii
> >>> >>> >> >> >> SRA OSS K.K.
> >>> >>> >> >> >> English: http://www.sraoss.co.jp/index_en/
> >>> >>> >> >> >> Japanese:http://www.sraoss.co.jp
> >>> >>> >> >> >>
> >>> >>> >> >> >> > Hi Tatsuo,
> >>> >>> >> >> >> >
> >>> >>> >> >> >> > Please see attached an updated version.
> >>> >>> >> >> >> >
> >>> >>> >> >> >> > thank you
> >>> >>> >> >> >> >
> >>> >>> >> >> >> > On Fri, Nov 7, 2025 at 2:07 AM Tatsuo Ishii <
> >>> >>> [email protected]>
> >>> >>> >> >> >> wrote:
> >>> >>> >> >> >> >
> >>> >>> >> >> >> >> > Sorry for that - thanks for the patch.
> >>> >>> >> >> >> >> >
> >>> >>> >> >> >> >> > Please find attached a new version
> >>> >>> >> >> >> >>
> >>> >>> >> >> >> >> Thanks for the new version. Unfortunately this time
> >>> regression
> >>> >>> >> test
> >>> >>> >> >> >> >> fails at:
> >>> >>> >> >> >> >>
> >>> >>> >> >> >> >> > Waiting for command timeout...
> >>> >>> >> >> >> >> > fail: command timeout not detected
> >>> >>> >> >> >> >>
> >>> >>> >> >> >> >> Attached is the pgpool.log.
> >>> >>> >> >> >> >>
> >>> >>> >> >> >> >> Best regards,
> >>> >>> >> >> >> >> --
> >>> >>> >> >> >> >> Tatsuo Ishii
> >>> >>> >> >> >> >> SRA OSS K.K.
> >>> >>> >> >> >> >> English: http://www.sraoss.co.jp/index_en/
> >>> >>> >> >> >> >> Japanese:http://www.sraoss.co.jp
> >>> >>> >> >> >> >>
> >>> >>> >> >> >> >> > On Mon, Nov 3, 2025 at 9:05 AM Tatsuo Ishii <
> >>> >>> >> [email protected]>
> >>> >>> >> >> >> >> wrote:
> >>> >>> >> >> >> >> >
> >>> >>> >> >> >> >> >> > thanks and sorry for the issues, please find
> attached
> >>> >>> updated
> >>> >>> >> >> >> version.
> >>> >>> >> >> >> >> >>
> >>> >>> >> >> >> >> >> No problem.
> >>> >>> >> >> >> >> >>
> >>> >>> >> >> >> >> >> This time the patch applies fine, no compiler
> warnings.
> >>> >>> >> However,
> >>> >>> >> >> >> >> >> regression test did not passed here (on Ubuntu 24
> LTS if
> >>> >>> this
> >>> >>> >> >> >> >> >> matters).  So I looked into
> >>> >>> >> >> >> >> >>
> >>> >>> >>
> src/test/regression/tests/041.external_replication_delay/test.sh a
> >>> >>> >> >> >> >> >> little bit and apply attached patch
> (test.sh.patch). It
> >>> >>> moved
> >>> >>> >> >> forward
> >>> >>> >> >> >> >> >> partially but failed at:
> >>> >>> >> >> >> >> >>
> >>> >>> >> >> >> >> >> fail: command execution failure not detected
> >>> >>> >> >> >> >> >>
> >>> >>> >> >> >> >> >> Please find attached
> >>> >>> >> >> >> >> >>
> >>> >>> >> >> >> >>
> >>> >>> >> >> >>
> >>> >>> >> >>
> >>> >>> >>
> >>> >>>
> >>>
> src/test/regression/tests/041.external_replication_delay/testdir/pgpool.log
> >>> >>> >> >> >> >> >> and
> >>> src/test/regression/log/041.external_replication_delay.
> >>> >>> >> >> >> >> >>
> >>> >>> >> >> >> >> >> Best regards,
> >>> >>> >> >> >> >> >> --
> >>> >>> >> >> >> >> >> Tatsuo Ishii
> >>> >>> >> >> >> >> >> SRA OSS K.K.
> >>> >>> >> >> >> >> >> English: http://www.sraoss.co.jp/index_en/
> >>> >>> >> >> >> >> >> Japanese:http://www.sraoss.co.jp
> >>> >>> >> >> >> >> >>
> >>> >>> >> >> >> >> >
> >>> >>> >> >> >> >> >
> >>> >>> >> >> >> >> > --
> >>> >>> >> >> >> >> > Nadav Shatz
> >>> >>> >> >> >> >> > Tailor Brands | CTO
> >>> >>> >> >> >> >>
> >>> >>> >> >> >> >
> >>> >>> >> >> >> >
> >>> >>> >> >> >> > --
> >>> >>> >> >> >> > Nadav Shatz
> >>> >>> >> >> >> > Tailor Brands | CTO
> >>> >>> >> >> >>
> >>> >>> >> >> >
> >>> >>> >> >> >
> >>> >>> >> >> > --
> >>> >>> >> >> > Nadav Shatz
> >>> >>> >> >> > Tailor Brands | CTO
> >>> >>> >> >>
> >>> >>> >> >
> >>> >>> >> >
> >>> >>> >> > --
> >>> >>> >> > Nadav Shatz
> >>> >>> >> > Tailor Brands | CTO
> >>> >>> >>
> >>> >>> >
> >>> >>> >
> >>> >>> > --
> >>> >>> > Nadav Shatz
> >>> >>> > Tailor Brands | CTO
> >>> >>>
> >>> >>
> >>> >>
> >>> >> --
> >>> >> Nadav Shatz
> >>> >> Tailor Brands | CTO
> >>>
> >>
> >>
> >> --
> >> Nadav Shatz
> >> Tailor Brands | CTO
> > <
>


-- 
Nadav Shatz
Tailor Brands | CTO


Attachments:

  [application/x-patch] latest.patch (52.0K, 3-latest.patch)
  download | inline diff:
From abb21a87eae41070ade61a245b91c5683808ad0a Mon Sep 17 00:00:00 2001
From: Nadav Shatz <[email protected]>
Date: Tue, 23 Dec 2025 13:39:04 +0200
Subject: [PATCH]    feat: external replication delay injection via external
 command

    Add support for obtaining replication delay from an external command
    instead of querying pg_stat_replication directly. This allows for
    more flexible monitoring setups where replication delay information
    may come from external monitoring systems.

    New configuration parameters:
    - replication_delay_source_cmd: Path to external command that provides
      delay values. When set, pgpool calls this command instead of querying
      PostgreSQL directly.
    - replication_delay_source_timeout: Timeout in seconds for the external
      command (default: 10).

    The external command receives replica identifiers as arguments in
    "host:port" format and should output delay values in milliseconds,
    one per line, corresponding to each replica argument.

    Includes regression test (041.external_replication_delay) covering:
    - Argument format validation
    - Integer and floating-point delay parsing
    - Error handling for malformed output and timeouts

diff --git a/doc/src/sgml/stream-check.sgml b/doc/src/sgml/stream-check.sgml
index d2ca3ca49c62dd481fb8e18616b12ab521521b1f..fc479908072f6afc63923ac699be0f63e15bc90a 100644
--- a/doc/src/sgml/stream-check.sgml
+++ b/doc/src/sgml/stream-check.sgml
@@ -309,6 +309,74 @@ GRANT pg_monitor TO sr_check_user;
     </listitem>
   </varlistentry>
 
+  <varlistentry id="guc-replication-delay-source-cmd" xreflabel="replication_delay_source_cmd">
+   <term><varname>replication_delay_source_cmd</varname> (<type>string</type>)
+    <indexterm>
+     <primary><varname>replication_delay_source_cmd</varname> configuration parameter</primary>
+    </indexterm>
+   </term>
+   <listitem>
+    <para>
+     Specifies an external command to retrieve replication delay information for replica nodes.
+     When this parameter is set and not empty, <productname>Pgpool-II</productname> uses the
+     external command instead of built-in database queries to obtain replication delays.
+     The command is executed as the <productname>Pgpool-II</productname> process user.
+    </para>
+    <para>
+     The command receives replica node identifiers as positional arguments, with the primary
+     node omitted. Each identifier is in the format <literal>&lt;hostname&gt;:&lt;port&gt;</literal>,
+     for example <literal>server1:5432 server2:5432</literal>. The order matches
+     <productname>Pgpool-II</productname>'s backend order (excluding the primary), allowing the
+     script to correlate external metrics (such as from AWS CloudWatch for Aurora) to the correct nodes.
+    </para>
+    <para>
+     The command must write a single line to stdout containing one whitespace-separated delay value
+     per replica, in milliseconds, in the same order as the arguments. The primary node's delay is
+     implicitly zero and should not be included in the output. Delay values can be integers or
+     floating-point numbers.
+    </para>
+    <para>
+     Special value: <literal>-1</literal> indicates a replica that is down but not yet detected
+     by <productname>Pgpool-II</productname>'s health checks. <productname>Pgpool-II</productname>
+     will log this condition but rely on its own health-check logic to decide whether to trigger
+     failover; no failover is triggered solely by receiving <literal>-1</literal>.
+    </para>
+    <para>
+     Example for a 3-node cluster (1 primary + 2 replicas): if the command receives arguments
+     <literal>server1:5432 server2:5432</literal>, it should output <literal>"25.5 100"</literal>
+     to indicate the first replica has 25.5ms delay and the second has 100ms delay.
+    </para>
+    <para>
+     Default is empty (use built-in replication delay queries).
+    </para>
+    <para>
+     This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+    </para>
+   </listitem>
+  </varlistentry>
+
+  <varlistentry id="guc-replication-delay-source-timeout" xreflabel="replication_delay_source_timeout">
+   <term><varname>replication_delay_source_timeout</varname> (<type>integer</type>)
+    <indexterm>
+     <primary><varname>replication_delay_source_timeout</varname> configuration parameter</primary>
+    </indexterm>
+   </term>
+   <listitem>
+    <para>
+     Specifies the timeout in seconds for the external command specified by
+     <xref linkend="guc-replication-delay-source-cmd">.
+     If the command does not finish within the timeout, <productname>Pgpool-II</productname>
+     logs an error and continues using the built-in method.
+    </para>
+    <para>
+     Default is 10 seconds. Valid range is 1-3600 seconds.
+    </para>
+    <para>
+     This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+    </para>
+   </listitem>
+  </varlistentry>
+
   <varlistentry id="guc-log-standby-delay" xreflabel="log_standby_delay">
    <term><varname>log_standby_delay</varname> (<type>enum</type>)
     <indexterm>
diff --git a/src/config/pool_config_variables.c b/src/config/pool_config_variables.c
index 0a0e483149190e14ca13406c08d0ee2ac0a9c53a..7c6d1803117541aaba50d9f9ff62e41e145c5d95 100644
--- a/src/config/pool_config_variables.c
+++ b/src/config/pool_config_variables.c
@@ -980,6 +980,16 @@ static struct config_string ConfigureNamesString[] =
 		NULL, NULL, NULL, NULL
 	},
 
+	{
+		{"replication_delay_source_cmd", CFGCXT_RELOAD, STREAMING_REPLICATION_CONFIG,
+			"External command to retrieve replication delay information.",
+			CONFIG_VAR_TYPE_STRING, false, 0
+		},
+		&g_pool_config.replication_delay_source_cmd,
+		"",
+		NULL, NULL, NULL, NULL
+	},
+
 	{
 		{"failback_command", CFGCXT_RELOAD, FAILOVER_CONFIG,
 			"Command to execute when backend node is attached.",
@@ -2334,6 +2344,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"replication_delay_source_timeout", CFGCXT_RELOAD, STREAMING_REPLICATION_CONFIG,
+			"Timeout for external replication delay command execution in seconds.",
+			CONFIG_VAR_TYPE_INT, false, 0
+		},
+		&g_pool_config.replication_delay_source_timeout,
+		10,
+		1, 3600,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	EMPTY_CONFIG_INT
 };
diff --git a/src/include/pool_config.h b/src/include/pool_config.h
index 758d515525c93c1c3f2686da049b294a286a574a..6f5f88fb200c2ea82cccddd8f02c35c8d0ade8f4 100644
--- a/src/include/pool_config.h
+++ b/src/include/pool_config.h
@@ -86,7 +86,6 @@ typedef enum LogStandbyDelayModes
 	LSD_NONE
 } LogStandbyDelayModes;
 
-
 typedef enum MemCacheMethod
 {
 	SHMEM_CACHE = 1,
@@ -364,6 +363,8 @@ typedef struct
 	char	   *sr_check_password;	/* password for sr_check_user */
 	char	   *sr_check_database;	/* PostgreSQL database name for streaming
 									 * replication check */
+	char	   *replication_delay_source_cmd;	/* external command for replication delay */
+	int			replication_delay_source_timeout;	/* timeout for external command in seconds */
 	char	   *failover_command;	/* execute command when failover happens */
 	char	   *follow_primary_command; /* execute command when failover is
 										 * ended */
diff --git a/src/sample/pgpool.conf.sample-stream b/src/sample/pgpool.conf.sample-stream
index 797906491cb996d24c59b3710f462f1405737248..454fdb9e5d1fd65437b6a67f12ab62658ea08f49 100644
--- a/src/sample/pgpool.conf.sample-stream
+++ b/src/sample/pgpool.conf.sample-stream
@@ -519,6 +519,20 @@ backend_clustering_mode = streaming_replication
 
 #sr_check_database = 'postgres'
                                    # Database name for streaming replication check
+
+#replication_delay_source_cmd = ''
+                                   # External command to retrieve replication delay information
+                                   # If set, pgpool uses this command instead of built-in queries
+                                   # Command receives replica node identifiers (host:port) as arguments
+                                   # Primary node is omitted from arguments
+                                   # Command should output one delay value (in ms) per replica
+                                   # Use -1 to indicate a replica that is down but not yet detected
+                                   # Format: "25 100" for 2 replicas (e.g., 3-node cluster with 1 primary)
+                                   # Command runs as the pgpool process user
+#replication_delay_source_timeout = 10
+                                   # Timeout for external command execution in seconds
+                                   # Range: 1-3600 seconds (default: 10)
+
 #delay_threshold = 0
                                    # Threshold before not dispatching query to standby node
                                    # Unit is in bytes
diff --git a/src/streaming_replication/pool_worker_child.c b/src/streaming_replication/pool_worker_child.c
index 5bf19c37d0cf1033c624f34ab3737f18871bc2f5..6c53fb8d59c00891456384a404b6ecab8c5451c5 100644
--- a/src/streaming_replication/pool_worker_child.c
+++ b/src/streaming_replication/pool_worker_child.c
@@ -43,6 +43,7 @@
 #include <unistd.h>
 #include <stdlib.h>
 #include <sys/time.h>
+#include <sys/wait.h>
 
 #ifdef HAVE_CRYPT_H
 #include <crypt.h>
@@ -76,6 +77,8 @@ static volatile sig_atomic_t restart_request = 0;
 static void establish_persistent_connection(void);
 static void discard_persistent_connection(void);
 static void check_replication_time_lag(void);
+static void check_replication_time_lag_with_cmd(void);
+static char *build_instance_identifier_for_node(int node_id);
 static void CheckReplicationTimeLagErrorCb(void *arg);
 static unsigned long long int text_to_lsn(char *text);
 static RETSIGTYPE my_signal_handler(int sig);
@@ -129,7 +132,7 @@ do_worker_child(void *params)
 	signal(SIGINT, my_signal_handler);
 	signal(SIGHUP, reload_config_handler);
 	signal(SIGQUIT, my_signal_handler);
-	signal(SIGCHLD, SIG_IGN);
+	signal(SIGCHLD, my_signal_handler);
 	signal(SIGUSR1, my_signal_handler);
 	signal(SIGUSR2, SIG_IGN);
 	signal(SIGPIPE, SIG_IGN);
@@ -260,7 +263,16 @@ do_worker_child(void *params)
 					int			i;
 
 					/* Do replication time lag checking */
-					check_replication_time_lag();
+
+					/*
+					 * Use external command if replication_delay_source_cmd is
+					 * configured
+					 */
+					if (pool_config->replication_delay_source_cmd &&
+						strlen(pool_config->replication_delay_source_cmd) > 0)
+						check_replication_time_lag_with_cmd();
+					else
+						check_replication_time_lag();
 
 					/* Check node status */
 					node_status = verify_backend_node_status(slots);
@@ -659,6 +671,446 @@ check_replication_time_lag(void)
 	error_context_stack = callback.previous;
 }
 
+#define MAX_CMD_OUTPUT 4096
+#define MAX_REASONABLE_DELAY_MS 3600000.0	/* 1 hour in milliseconds */
+
+/*
+ * Check replication time lag using external command
+ *
+ * The external command receives only replica (standby) node identifiers as arguments,
+ * omitting the primary node. It returns delay values in milliseconds for each replica.
+ * A value of -1 indicates a node that is down but not yet detected by pgpool's health checks.
+ */
+static void
+check_replication_time_lag_with_cmd(void)
+{
+	char	   *command = NULL;
+	char	   *line;
+	char	   *token;
+	char	   *saveptr;
+	char	   *line_copy;
+	char	   *temp_token;
+	char	   *endptr;
+	char	   *ident;
+	const char *base_command;
+	double		delay_ms;
+	uint64		delay;
+	uint64		delay_threshold_by_time;
+	int			token_count = 0;
+	int			primary_node_id;
+	int			save_errno;
+	int			i;
+	size_t		total_len;
+	size_t		current_len;
+	BackendInfo *bkinfo;
+	ErrorContextCallback callback;
+	int			pipefd[2] = {-1, -1};
+	pid_t		pid = -1;
+	int			ret;
+	struct timeval timeout;
+	fd_set		readfds;
+	ssize_t		bytes_read;
+	int			status;
+	int			num_replicas;
+
+	if (NUM_BACKENDS <= 1)
+	{
+		/* If there's only one node, there's no point to do checking */
+		return;
+	}
+
+	if (REAL_PRIMARY_NODE_ID < 0)
+	{
+		/* No need to check if there's no primary */
+		return;
+	}
+
+	if (!VALID_BACKEND(REAL_PRIMARY_NODE_ID))
+	{
+		/* No need to check replication delay if primary is down */
+		return;
+	}
+
+	/* Capture primary node ID to avoid race conditions during execution */
+	primary_node_id = REAL_PRIMARY_NODE_ID;
+
+	if (!pool_config->replication_delay_source_cmd ||
+		strlen(pool_config->replication_delay_source_cmd) == 0)
+	{
+		ereport(WARNING,
+				(errmsg("replication_delay_source_cmd is not configured"),
+				 errhint("Set replication_delay_source_cmd to use external command mode")));
+		/* Fall back to builtin method */
+		check_replication_time_lag();
+		return;
+	}
+
+	/* Allocate buffer for command output */
+	line = palloc(MAX_CMD_OUTPUT);
+	memset(line, 0, MAX_CMD_OUTPUT);
+
+	/*
+	 * Register a error context callback to throw proper context message
+	 */
+	callback.callback = CheckReplicationTimeLagErrorCb;
+	callback.arg = NULL;
+	callback.previous = error_context_stack;
+	error_context_stack = &callback;
+
+	/* Execute command as current process user */
+	PG_TRY();
+	{
+		base_command = pool_config->replication_delay_source_cmd;
+		total_len = strlen(base_command) + 1;	/* +1 for NUL */
+
+		/* Build command with replica-only arguments (omit primary) */
+
+		/*
+		 * Calculate total command length including space-separated replica
+		 * identifiers
+		 */
+		for (i = 0; i < NUM_BACKENDS; i++)
+		{
+			if (i == primary_node_id)
+				continue;		/* Skip primary node */
+
+			ident = build_instance_identifier_for_node(i);
+
+			total_len += 1 /* space */ + strlen(ident);
+			pfree(ident);
+		}
+
+		command = palloc(total_len);
+		strlcpy(command, base_command, total_len);
+
+		/* Append replica identifiers */
+		current_len = strlen(command);
+
+		for (i = 0; i < NUM_BACKENDS; i++)
+		{
+			if (i == primary_node_id)
+				continue;		/* Skip primary node */
+
+			ident = build_instance_identifier_for_node(i);
+
+			/* Append space and identifier */
+			snprintf(command + current_len, total_len - current_len, " %s", ident);
+			current_len += strlen(command + current_len);
+
+			pfree(ident);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("executing replication delay command: %s", command)));
+
+		if (pipe(pipefd) == -1)
+		{
+			ereport(ERROR,
+					(errmsg("pipe failed: %m")));
+		}
+
+		pid = fork();
+		if (pid == -1)
+		{
+			close(pipefd[0]);
+			close(pipefd[1]);
+			ereport(ERROR,
+					(errmsg("fork failed: %m")));
+		}
+
+		if (pid == 0)
+		{
+			/* Child process */
+			close(pipefd[0]);	/* Close read end */
+			if (dup2(pipefd[1], STDOUT_FILENO) == -1)
+			{
+				fprintf(stderr, "dup2 failed: %s\n", strerror(errno));
+				exit(1);
+			}
+			close(pipefd[1]);	/* Close write end (duplicated to stdout) */
+
+			/* Execute command using shell */
+			execl("/bin/sh", "sh", "-c", command, (char *) NULL);
+
+			/* If execl fails */
+			fprintf(stderr, "execl failed: %s\n", strerror(errno));
+			_exit(127);
+		}
+
+		/* Parent process */
+		close(pipefd[1]);		/* Close write end */
+		pipefd[1] = -1;
+
+		/* Set up timeout for select */
+		timeout.tv_sec = pool_config->replication_delay_source_timeout;
+		timeout.tv_usec = 0;
+
+		FD_ZERO(&readfds);
+		FD_SET(pipefd[0], &readfds);
+
+		/* Wait for output or timeout */
+		ret = select(pipefd[0] + 1, &readfds, NULL, NULL, &timeout);
+
+		if (ret == -1)
+		{
+			save_errno = errno;
+
+			kill(pid, SIGKILL);
+			waitpid(pid, NULL, 0);
+			pid = -1;
+			close(pipefd[0]);
+			pipefd[0] = -1;
+			if (save_errno == EINTR)
+			{
+				/* Interrupted */
+				ereport(ERROR,
+						(errmsg("select interrupted during replication delay command execution")));
+			}
+			else
+			{
+				ereport(ERROR,
+						(errmsg("select failed: %m")));
+			}
+		}
+		else if (ret == 0)
+		{
+			/* Timeout */
+			kill(pid, SIGKILL);
+			waitpid(pid, NULL, 0);
+			pid = -1;
+			close(pipefd[0]);
+			pipefd[0] = -1;
+			ereport(ERROR,
+					(errmsg("replication delay command timed out after %d seconds: %s",
+							pool_config->replication_delay_source_timeout, command),
+					 errhint("Consider increasing replication_delay_source_timeout or optimizing the command")));
+		}
+
+		/* Data is available */
+		bytes_read = read(pipefd[0], line, MAX_CMD_OUTPUT - 1);
+		close(pipefd[0]);
+		pipefd[0] = -1;
+
+		/* Wait for child to finish */
+		waitpid(pid, &status, 0);
+		pid = -1;
+
+		if (bytes_read < 0)
+		{
+			ereport(ERROR,
+					(errmsg("failed to read output from replication delay command: %s", command),
+					 errdetail("read failed: %m")));
+		}
+
+		/* Check exit status */
+		if (WIFEXITED(status) && WEXITSTATUS(status) != 0)
+		{
+			ereport(ERROR,
+					(errmsg("replication delay command failed with exit code %d: %s",
+							WEXITSTATUS(status), command)));
+		}
+		else if (WIFSIGNALED(status))
+		{
+			ereport(ERROR,
+					(errmsg("replication delay command terminated by signal %d: %s",
+							WTERMSIG(status), command)));
+		}
+
+		/* Check if output was truncated */
+		if (bytes_read == MAX_CMD_OUTPUT - 1 && line[MAX_CMD_OUTPUT - 2] != '\n')
+		{
+			ereport(WARNING,
+					(errmsg("replication delay command output may have been truncated")));
+		}
+
+		/* Null-terminate the string */
+		line[bytes_read] = '\0';
+
+		pfree(command);
+		command = NULL;
+
+		/* Set primary node delay to 0 */
+		bkinfo = pool_get_node_info(primary_node_id);
+		bkinfo->standby_delay = 0;
+		bkinfo->standby_delay_by_time = true;
+
+		/* Count expected replicas */
+		num_replicas = NUM_BACKENDS - 1;	/* Total nodes minus primary */
+
+		/* Count tokens in output for validation */
+		line_copy = pstrdup(line);
+		temp_token = strtok(line_copy, " \t\n");
+
+		while (temp_token != NULL)
+		{
+			token_count++;
+			temp_token = strtok(NULL, " \t\n");
+		}
+		pfree(line_copy);
+
+		/* Validate output format */
+		if (token_count == 0)
+		{
+			ereport(WARNING,
+					(errmsg("replication delay command produced no output"),
+					 errhint("Command should output delay values separated by spaces, one per replica node")));
+		}
+		else if (token_count < num_replicas)
+		{
+			ereport(WARNING,
+					(errmsg("replication delay command returned %d values, expected %d (one per replica, excluding primary)",
+							token_count, num_replicas),
+					 errhint("Command should output one delay value per replica node. Missing values will be treated as 0.")));
+		}
+		else if (token_count > num_replicas)
+		{
+			ereport(WARNING,
+					(errmsg("replication delay command returned %d values, expected %d (one per replica, excluding primary)",
+							token_count, num_replicas),
+					 errhint("Command should output exactly one delay value per replica node. Extra values will be ignored.")));
+		}
+
+		/* Parse the output - one delay value per replica in order */
+		token = strtok_r(line, " \t\n", &saveptr);
+
+		for (i = 0; i < NUM_BACKENDS && token != NULL; i++)
+		{
+			if (i == primary_node_id)
+				continue;		/* Skip primary - it's not in the output */
+
+			if (!VALID_BACKEND(i))
+			{
+				/* Skip invalid backend but consume token */
+				token = strtok_r(NULL, " \t\n", &saveptr);
+				continue;
+			}
+
+			delay_ms = strtod(token, &endptr);
+
+			/* Validate the conversion */
+			if (*endptr != '\0')
+			{
+				ereport(WARNING,
+						(errmsg("invalid delay value '%s' for node %d, treating as 0",
+								token, i)));
+				delay_ms = 0;
+			}
+
+			bkinfo = pool_get_node_info(i);
+
+			/* Handle -1 for down nodes */
+			if (delay_ms == -1.0)
+			{
+				ereport(LOG,
+						(errmsg("node %d reported as down by external command (delay -1), relying on health check for failover decision",
+								i)));
+				/* Keep previous delay value, don't trigger failover */
+				token = strtok_r(NULL, " \t\n", &saveptr);
+				continue;
+			}
+
+			/* Validate delay value range */
+			if (delay_ms < 0)
+			{
+				ereport(WARNING,
+						(errmsg("negative delay value %.3f for node %d (other than -1), treating as 0",
+								delay_ms, i)));
+				delay_ms = 0;
+			}
+			else if (delay_ms > MAX_REASONABLE_DELAY_MS)
+			{
+				ereport(WARNING,
+						(errmsg("extremely large delay value %.3f for node %d",
+								delay_ms, i)));
+			}
+
+			/*
+			 * Convert delay from milliseconds to microseconds for internal
+			 * storage
+			 */
+			delay = (uint64) (delay_ms * 1000);
+			bkinfo->standby_delay = delay;
+			bkinfo->standby_delay_by_time = true;
+
+			/* Log delay if necessary */
+			delay_threshold_by_time = pool_config->delay_threshold_by_time * 1000;	/* threshold is in
+																					 * milliseconds, convert
+																					 * to microseconds */
+
+			if ((pool_config->log_standby_delay == LSD_ALWAYS && delay_ms > 0) ||
+				(pool_config->log_standby_delay == LSD_OVER_THRESHOLD &&
+				 bkinfo->standby_delay > delay_threshold_by_time))
+			{
+				ereport(LOG,
+						(errmsg("Replication of node: %d is behind %.3f second(s) from the primary server (node: %d) [external command]",
+								i, delay_ms / 1000, primary_node_id)));
+			}
+
+			token = strtok_r(NULL, " \t\n", &saveptr);
+		}
+
+	}
+	PG_CATCH();
+	{
+		/* Cleanup in case of error */
+		if (pid > 0)
+		{
+			kill(pid, SIGKILL);
+			waitpid(pid, NULL, 0);
+		}
+		if (pipefd[0] != -1)
+			close(pipefd[0]);
+		if (pipefd[1] != -1)
+			close(pipefd[1]);
+
+		if (line)
+			pfree(line);
+		if (command)
+			pfree(command);
+		error_context_stack = callback.previous;
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	/* Normal cleanup */
+	if (line)
+		pfree(line);
+
+	error_context_stack = callback.previous;
+}
+
+/*
+ * build_instance_identifier_for_node
+ *  Build an identifier string for a backend node for passing to external commands.
+ *  Format: "<hostname>:<port>"
+ */
+static char *
+build_instance_identifier_for_node(int node_id)
+{
+	BackendInfo *bi = pool_get_node_info(node_id);
+	const char *hostname;
+
+	if (!bi || bi->backend_hostname[0] == '\0' || bi->backend_port <= 0)
+	{
+		/* Fallback if hostname or port is not set */
+		return psprintf("unknown_node_%d", node_id);
+	}
+
+	hostname = bi->backend_hostname;
+
+	/* Validate hostname for security - check for shell metacharacters */
+	if (strpbrk(hostname, "$`\\|;&<>()[]{}\"\'\n\r\t") != NULL)
+	{
+		ereport(LOG,
+				(errmsg("hostname for node %d contains potentially dangerous characters: %s",
+						node_id, hostname),
+				 errhint("Hostnames with shell metacharacters may pose security risks when used with external commands. Consider using IP addresses or sanitized hostnames.")));
+	}
+
+	/* Use hostname:port format */
+	return psprintf("%s:%d", hostname, bi->backend_port);
+}
+
 static void
 CheckReplicationTimeLagErrorCb(void *arg)
 {
@@ -715,6 +1167,9 @@ static RETSIGTYPE my_signal_handler(int sig)
 			restart_request = 1;
 			break;
 
+		case SIGCHLD:
+			break;
+
 		default:
 			exit(1);
 			break;
diff --git a/src/test/regression/tests/041.external_replication_delay/README b/src/test/regression/tests/041.external_replication_delay/README
new file mode 100644
index 0000000000000000000000000000000000000000..b4df5da402b557190c8f6a2bc7822944cc5b04cc
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/README
@@ -0,0 +1,59 @@
+External Replication Delay Command Test
+========================================
+
+This test verifies the external command replication delay source feature.
+
+Test Coverage:
+- External command receives replica node identifiers only (primary omitted)
+- Instance identifiers in host:port format
+- Basic external command execution with integer and float millisecond values
+- Delay threshold functionality with external commands
+- Command execution as pgpool process user (no su wrapper)
+- Error handling for missing/invalid commands
+- Command execution failure scenarios
+- Command timeout handling with configurable timeout values
+- Input validation for invalid, negative (other than -1), and extremely large delay values
+- Handling of -1 for down nodes (logged but no immediate failover)
+- Wrong number of output values validation
+- Multiple -1 values (multiple down replicas)
+- Mixed scenarios (some replicas up, some down)
+- Output truncation detection
+
+Files:
+- test.sh: Main test script
+- test_parsing.sh: Unit test for parsing logic
+- test_validation.sh: Validation and edge case testing
+- README: This documentation
+
+Key Changes from Original Version:
+- Primary node is omitted from command arguments
+- Command receives only replica identifiers
+- Instance identifiers are in host:port format (not application_name)
+- Output format: one delay per replica (not per all nodes)
+- -1 value indicates down replica without triggering failover
+- Format example: "25 100" for 2 replicas (3-node cluster = 1 primary + 2 replicas)
+
+The test creates temporary command scripts that output delay values in the format:
+"replica1_delay replica2_delay ..."
+
+Where delays are in milliseconds and can be integer or floating-point values.
+Special value -1 indicates a replica that is down but not yet detected by pgpool.
+
+Test Environment:
+- Uses streaming replication mode with 3 nodes
+- Node 0 is primary (omitted from command arguments)
+- Nodes 1 and 2 are replicas (included in command arguments)
+- Configures sr_check_period = 1 second for faster testing
+- Tests various delay scenarios and threshold behaviors
+
+Expected Behavior:
+- External commands receive replica identifiers in host:port format
+- Primary node identifier is never passed to command
+- Command outputs one delay value per replica
+- -1 values are logged but don't trigger immediate failover
+- Delay values are parsed correctly (both int and float)
+- Threshold comparisons work properly
+- Error conditions are handled gracefully
+- Commands timeout appropriately based on configuration
+- Timeout errors provide helpful messages and hints
+- Tests are reliable with proper wait mechanisms instead of fixed sleeps
diff --git a/src/test/regression/tests/041.external_replication_delay/test.sh b/src/test/regression/tests/041.external_replication_delay/test.sh
new file mode 100755
index 0000000000000000000000000000000000000000..de704e55331247893f4b2e26fb67977875f1ba42
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/test.sh
@@ -0,0 +1,409 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for external command replication delay source
+#
+source $TESTLIBS
+TESTDIR=testdir
+PG_CTL=$PGBIN/pg_ctl
+PSQL="$PGBIN/psql -X "
+
+rm -fr $TESTDIR
+mkdir $TESTDIR
+cd $TESTDIR
+
+# create test environment
+echo -n "creating test environment..."
+$PGPOOL_SETUP -m s -n 3 || exit 1
+echo "done."
+source ./bashrc.ports
+export PGPORT=$PGPOOL_PORT
+
+# Create external command scripts for testing
+# NOTE: Commands now only output delay values for REPLICAS (not primary)
+cat > delay_cmd_static.sh << 'EOF'
+#!/bin/bash
+# Static delay values for replicas: node1=25ms, node2=50ms (node0 is primary, not included)
+echo "25 50"
+EOF
+chmod +x delay_cmd_static.sh
+
+cat > delay_cmd_float.sh << 'EOF'
+#!/bin/bash
+# Float delay values for replicas: node1=25.5ms, node2=100.75ms
+echo "25.5 100.75"
+EOF
+chmod +x delay_cmd_float.sh
+
+cat > delay_cmd_high.sh << 'EOF'
+#!/bin/bash
+# High delay values to test threshold: node1=2000ms, node2=3000ms
+echo "2000 3000"
+EOF
+chmod +x delay_cmd_high.sh
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test0: External command receives replica identifiers only (primary omitted) ==="
+# ----------------------------------------------------------------------------------------
+# Command that captures its arguments and outputs valid delays for 2 replicas
+cat > delay_cmd_args.sh << 'EOF'
+#!/bin/bash
+printf "%s " "$@" > args.txt
+echo "25 50"
+EOF
+chmod +x delay_cmd_args.sh
+
+echo "replication_delay_source_cmd = './delay_cmd_args.sh'" >> etc/pgpool.conf
+echo "sr_check_period = 1" >> etc/pgpool.conf
+echo "log_min_messages = 'DEBUG1'" >> etc/pgpool.conf
+# Reduce memory requirements for macOS shared memory limits
+echo "num_init_children = 4" >> etc/pgpool.conf
+echo "max_pool = 2" >> etc/pgpool.conf
+# Disable query caching to avoid shared memory issues on macOS
+echo "memory_cache_enabled = off" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+echo "Waiting for sr_check to pass args..."
+for i in {1..10}; do
+    if [ -f args.txt ]; then
+        break
+    fi
+    sleep 1
+done
+
+if [ ! -f args.txt ]; then
+    echo fail: did not capture command arguments
+    ./shutdownall
+    exit 1
+fi
+
+ARGS_CONTENT=$(cat args.txt | sed 's/[[:space:]]*$//')
+# Should receive 2 replica identifiers in host:port format (localhost:11003 localhost:11004 or server1:11003 server2:11004)
+# Primary (localhost:11002 or server0:11002) should be omitted
+if ! echo "$ARGS_CONTENT" | grep -qE "(server1|localhost):11003"; then
+    echo "fail: expected replica1:11003 in arguments, got: '$ARGS_CONTENT'"
+    ./shutdownall
+    exit 1
+fi
+if ! echo "$ARGS_CONTENT" | grep -qE "(server2|localhost):11004"; then
+    echo "fail: expected replica2:11004 in arguments, got: '$ARGS_CONTENT'"
+    ./shutdownall
+    exit 1
+fi
+if echo "$ARGS_CONTENT" | grep -qE "(server0|localhost):11002"; then
+    echo "fail: primary should not be in arguments, got: '$ARGS_CONTENT'"
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: argument order correct - replicas only, primary omitted, host:port format
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test1: Basic external command with integer millisecond values ==="
+# ----------------------------------------------------------------------------------------
+echo "replication_delay_source_cmd = './delay_cmd_static.sh'" >> etc/pgpool.conf
+echo "sr_check_period = 1" >> etc/pgpool.conf
+echo "log_standby_delay = 'always'" >> etc/pgpool.conf
+echo "log_min_messages = 'DEBUG1'" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+$PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+EOF
+
+# Wait for sr_check to run and populate delay values
+# sr_check_period is 1 second, so wait a bit longer to ensure it runs
+echo "Waiting for sr_check to run..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command" log/pgpool.log 2>/dev/null; then
+        echo "Command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SHOW POOL_NODES;
+EOF
+
+# Check that delay values are populated in the log
+grep "executing replication delay command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: external command was not executed
+    echo "Log contents:"
+    tail -20 log/pgpool.log
+    ./shutdownall
+    exit 1
+fi
+
+# Verify actual delay values were parsed
+if ! $PSQL -t -c "SHOW POOL_NODES" test | grep -E "[0-9]+\.[0-9]+" >/dev/null; then
+    echo "Warning: No delay values found in POOL_NODES output"
+fi
+
+# Check for delay log messages
+grep "Replication of node.*external command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: external command delay logging not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: basic external command test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test2: External command with floating-point millisecond values ==="
+# ----------------------------------------------------------------------------------------
+# Update configuration to use float command
+sed -i.bak "s|delay_cmd_static.sh|delay_cmd_float.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run with float values
+echo "Waiting for sr_check with float values..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*delay_cmd_float.sh" log/pgpool.log 2>/dev/null; then
+        echo "Float command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SHOW POOL_NODES;
+EOF
+
+# Check that float values are handled correctly
+grep "executing replication delay command.*delay_cmd_float.sh" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: float command was not executed
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: floating-point values test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test3: External command with delay threshold ==="
+# ----------------------------------------------------------------------------------------
+# Update configuration to use high delay command and set threshold
+sed -i.bak "s|delay_cmd_float.sh|delay_cmd_high.sh|" etc/pgpool.conf
+echo "delay_threshold_by_time = 1000" >> etc/pgpool.conf
+echo "backend_weight0 = 0" >> etc/pgpool.conf  # Force queries to standby normally
+echo "backend_weight2 = 0" >> etc/pgpool.conf  # Only use node 1 as standby
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run and detect high delays
+echo "Waiting for sr_check with high delay values..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*delay_cmd_high.sh" log/pgpool.log 2>/dev/null; then
+        echo "High delay command executed after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+$PSQL test <<EOF
+SELECT * FROM t1 LIMIT 1;
+EOF
+
+# With high delays (2000ms > 1000ms threshold), query should go to primary (node 0)
+# Log format can vary: either "statement: SELECT..." or "SELECT... DB node id:"
+if ! grep -E "DB node id: 0.*statement: SELECT \* FROM t1 LIMIT 1" log/pgpool.log >/dev/null 2>&1 && \
+   ! grep -E "SELECT \* FROM t1 LIMIT 1.*DB node id: 0" log/pgpool.log >/dev/null 2>&1; then
+    echo fail: query was not sent to primary node despite high delay
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: delay threshold test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test4: External command execution as process user ==="
+# ----------------------------------------------------------------------------------------
+# Test that command runs as the current pgpool process user
+sed -i.bak "s|delay_cmd_high.sh|delay_cmd_static.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for sr_check to run as process user..."
+for i in {1..10}; do
+    if grep -q "executing replication delay command.*delay_cmd_static.sh" log/pgpool.log 2>/dev/null; then
+        echo "Command executed as process user after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check that command was executed (without su wrapper)
+grep "executing replication delay command.*delay_cmd_static.sh" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: command was not executed as process user
+    ./shutdownall
+    exit 1
+fi
+
+# Verify no su command was used
+if grep -q "executing replication delay command.*su.*" log/pgpool.log 2>/dev/null; then
+    echo fail: command should not use su wrapper
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: process user execution test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test5: Error handling - missing command ==="
+# ----------------------------------------------------------------------------------------
+# Test error handling when command is not configured
+sed -i.bak "s|replication_delay_source_cmd = './delay_cmd_static.sh'|replication_delay_source_cmd = ''|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# With empty command, should fall back to builtin method
+# No specific error message expected - just verify it doesn't crash
+sleep 3
+
+echo "ok: empty command test succeeded (fallback to builtin)"
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test6: Error handling - command execution failure ==="
+# ----------------------------------------------------------------------------------------
+# Test error handling when command fails
+echo "replication_delay_source_cmd = './nonexistent_command.sh'" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run with failing command
+echo "Waiting for sr_check with failing command..."
+for i in {1..5}; do
+    # Check for various error conditions: exit code failure, no output, or explicit failure message
+    if grep -qE "(replication delay command failed with exit code|replication delay command produced no output|failed to (execute|read output from) replication delay command)" log/pgpool.log 2>/dev/null; then
+        echo "Command failure detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for error message about command execution failure
+# Accept multiple possible error messages depending on shell behavior:
+# - "failed with exit code" when command returns non-zero
+# - "produced no output" when command produces empty output
+# - "failed to execute/read" for other failures
+if ! grep -qE "(replication delay command failed with exit code|replication delay command produced no output|failed to (execute|read output from) replication delay command)" log/pgpool.log 2>/dev/null; then
+    echo fail: command execution failure not detected
+    echo "Log contents:"
+    tail -50 log/pgpool.log
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: command failure test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test7: Command timeout handling ==="
+# ----------------------------------------------------------------------------------------
+# Create a command that takes longer than the timeout
+cat > delay_cmd_slow.sh << 'EOF'
+#!/bin/bash
+# Slow command that takes 15 seconds (longer than default 10s timeout)
+sleep 15
+echo "25 50"
+EOF
+chmod +x delay_cmd_slow.sh
+
+# Set a short timeout and use the slow command
+sed -i.bak "s|replication_delay_source_cmd = './nonexistent_command.sh'|replication_delay_source_cmd = './delay_cmd_slow.sh'|" etc/pgpool.conf
+echo "replication_delay_source_timeout = 3" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run and timeout
+echo "Waiting for command timeout..."
+for i in {1..15}; do
+    if grep -q "replication delay command timed out" log/pgpool.log 2>/dev/null; then
+        echo "Command timeout detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for timeout error message
+grep "replication delay command timed out after 3 seconds" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: command timeout not detected
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: command timeout test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test8: Handling of -1 for down nodes ==="
+# ----------------------------------------------------------------------------------------
+# Create a command that returns -1 for one replica
+cat > delay_cmd_with_down_node.sh << 'EOF'
+#!/bin/bash
+# Return -1 for first replica (indicating it's down), normal value for second
+echo "-1 50"
+EOF
+chmod +x delay_cmd_with_down_node.sh
+
+# Reset config
+rm -f etc/pgpool.conf.bak
+sed -i.bak "s|delay_cmd_slow.sh|delay_cmd_with_down_node.sh|" etc/pgpool.conf
+sed -i.bak "s|replication_delay_source_timeout = 3|replication_delay_source_timeout = 10|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to process -1 value
+echo "Waiting for sr_check to process -1 value..."
+for i in {1..10}; do
+    if grep -q "node.*reported as down by external command.*delay -1" log/pgpool.log 2>/dev/null; then
+        echo "-1 handling detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for -1 logging message
+grep "node.*reported as down by external command.*delay -1.*relying on health check" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: -1 handling message not found
+    ./shutdownall
+    exit 1
+fi
+
+# Verify that pgpool didn't trigger failover just from -1
+# Check for actual failover execution, not just config mentions of failover_command
+if grep -qE "(starting.*(failover|degeneration)|failover done|execute.*(failover|failback)_command)" log/pgpool.log 2>/dev/null; then
+    echo "fail: -1 should not trigger immediate failover"
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: -1 handling test succeeded
+./shutdownall
+
+echo "All external replication delay tests passed!"
+exit 0
diff --git a/src/test/regression/tests/041.external_replication_delay/test_parsing.sh b/src/test/regression/tests/041.external_replication_delay/test_parsing.sh
new file mode 100755
index 0000000000000000000000000000000000000000..82fdad144cf5a94efbf79020a50ebc2ef00d6fb8
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/test_parsing.sh
@@ -0,0 +1,54 @@
+#!/bin/bash
+#-------------------------------------------------------------------
+# Unit test for external command parsing logic
+# This tests the parsing without needing a full pgpool setup
+#
+
+echo "=== Testing external command output parsing ==="
+
+# Test 1: Integer values
+echo "Test 1: Integer millisecond values"
+echo "0 25 50" > test_output.txt
+echo "Expected: 0ms, 25ms, 50ms"
+echo "Output: $(cat test_output.txt)"
+echo ""
+
+# Test 2: Float values
+echo "Test 2: Floating-point millisecond values"
+echo "0 25.5 100.75" > test_output_float.txt
+echo "Expected: 0ms, 25.5ms, 100.75ms"
+echo "Output: $(cat test_output_float.txt)"
+echo ""
+
+# Test 3: High precision float values
+echo "Test 3: High precision values"
+echo "0 0.001 999.999" > test_output_precision.txt
+echo "Expected: 0ms, 0.001ms, 999.999ms"
+echo "Output: $(cat test_output_precision.txt)"
+echo ""
+
+# Test 4: Edge case - zero values
+echo "Test 4: All zero values"
+echo "0 0 0" > test_output_zeros.txt
+echo "Expected: 0ms, 0ms, 0ms"
+echo "Output: $(cat test_output_zeros.txt)"
+echo ""
+
+# Test 5: Edge case - large values
+echo "Test 5: Large delay values"
+echo "0 5000 10000" > test_output_large.txt
+echo "Expected: 0ms, 5000ms, 10000ms"
+echo "Output: $(cat test_output_large.txt)"
+echo ""
+
+# Test 6: Mixed integer and float values
+echo "Test 6: Mixed integer and float values"
+echo "0 25 50.5" > test_output_mixed.txt
+echo "Expected: 0ms, 25ms, 50.5ms"
+echo "Output: $(cat test_output_mixed.txt)"
+echo ""
+
+# Cleanup
+rm -f test_output_*.txt
+
+echo "All parsing tests completed. These outputs should be parseable by the external command feature."
diff --git a/src/test/regression/tests/041.external_replication_delay/test_validation.sh b/src/test/regression/tests/041.external_replication_delay/test_validation.sh
new file mode 100755
index 0000000000000000000000000000000000000000..2cd4a7f0b35e152b6d4b770931ed4821cdd9d201
--- /dev/null
+++ b/src/test/regression/tests/041.external_replication_delay/test_validation.sh
@@ -0,0 +1,323 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for external command validation and edge cases
+#
+source $TESTLIBS
+TESTDIR=testdir_validation
+PG_CTL=$PGBIN/pg_ctl
+PSQL="$PGBIN/psql -X "
+
+rm -fr $TESTDIR
+mkdir $TESTDIR
+cd $TESTDIR
+
+# create test environment
+echo -n "creating test environment..."
+$PGPOOL_SETUP -m s -n 3 || exit 1
+echo "done."
+source ./bashrc.ports
+export PGPORT=$PGPOOL_PORT
+
+# Create test command scripts
+# NOTE: All commands output values for REPLICAS only (primary omitted)
+cat > delay_cmd_validation.sh << 'EOF'
+#!/bin/bash
+# Test validation: output with invalid values for 2 replicas
+echo "invalid_value 50.5"
+EOF
+chmod +x delay_cmd_validation.sh
+
+cat > delay_cmd_negative.sh << 'EOF'
+#!/bin/bash
+# Test negative values (other than -1)
+echo "-25 50"
+EOF
+chmod +x delay_cmd_negative.sh
+
+cat > delay_cmd_large.sh << 'EOF'
+#!/bin/bash
+# Test extremely large values
+echo "9999999 50"
+EOF
+chmod +x delay_cmd_large.sh
+
+cat > delay_cmd_wrong_count.sh << 'EOF'
+#!/bin/bash
+# Test wrong number of values (only 1 instead of 2 for 2 replicas)
+echo "25"
+EOF
+chmod +x delay_cmd_wrong_count.sh
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test1: Validation of invalid delay values ==="
+# ----------------------------------------------------------------------------------------
+echo "replication_delay_source_cmd = './delay_cmd_validation.sh'" >> etc/pgpool.conf
+echo "sr_check_period = 1" >> etc/pgpool.conf
+echo "log_standby_delay = 'always'" >> etc/pgpool.conf
+echo "log_min_messages = 'DEBUG1'" >> etc/pgpool.conf
+# Reduce memory requirements for macOS shared memory limits
+echo "num_init_children = 4" >> etc/pgpool.conf
+echo "max_pool = 2" >> etc/pgpool.conf
+# Disable query caching to avoid shared memory issues on macOS
+echo "memory_cache_enabled = off" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+$PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+EOF
+
+# Wait for sr_check to run
+echo "Waiting for validation test..."
+for i in {1..10}; do
+    if grep -q "invalid delay value" log/pgpool.log 2>/dev/null; then
+        echo "Validation error detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for validation warning
+grep "invalid delay value 'invalid_value' for node" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: validation warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: invalid value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test2: Negative delay values (other than -1) ==="
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_validation.sh|delay_cmd_negative.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for negative value test..."
+for i in {1..10}; do
+    if grep -q "negative delay value.*other than -1" log/pgpool.log 2>/dev/null; then
+        echo "Negative value warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for negative value warning
+grep "negative delay value.*other than -1.*treating as 0" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: negative value warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: negative value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test3: Extremely large delay values ==="
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_negative.sh|delay_cmd_large.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for large value test..."
+for i in {1..10}; do
+    if grep -q "extremely large delay value" log/pgpool.log 2>/dev/null; then
+        echo "Large value warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for large value warning
+grep "extremely large delay value.*for node" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: large value warning not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: large value validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test4: Wrong number of output values ==="
+# ----------------------------------------------------------------------------------------
+sed -i.bak "s|delay_cmd_large.sh|delay_cmd_wrong_count.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for wrong count test..."
+for i in {1..10}; do
+    if grep -q "returned.*values, expected.*replica" log/pgpool.log 2>/dev/null; then
+        echo "Wrong count warning detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for wrong count warning
+grep "returned.*values, expected.*replica.*Command should output one delay value per replica" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: wrong count validation test not found
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: wrong count validation test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test5: Multiple -1 values ==="
+# ----------------------------------------------------------------------------------------
+cat > delay_cmd_multi_down.sh << 'EOF'
+#!/bin/bash
+# Test multiple replicas down
+echo "-1 -1"
+EOF
+chmod +x delay_cmd_multi_down.sh
+
+sed -i.bak "s|delay_cmd_wrong_count.sh|delay_cmd_multi_down.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check to run
+echo "Waiting for multi-down test..."
+for i in {1..10}; do
+    if grep -q "node.*reported as down by external command" log/pgpool.log 2>/dev/null; then
+        echo "Multiple down nodes detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for multiple -1 handling
+DOWN_COUNT=$(grep -c "node.*reported as down by external command.*delay -1" log/pgpool.log)
+if [ "$DOWN_COUNT" -lt 2 ]; then
+    echo fail: expected 2 down node messages, found $DOWN_COUNT
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: multiple -1 handling test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test6: Command timeout with different timeout values ==="
+# ----------------------------------------------------------------------------------------
+cat > delay_cmd_timeout.sh << 'EOF'
+#!/bin/bash
+# Command that takes 5 seconds
+sleep 5
+echo "25 50"
+EOF
+chmod +x delay_cmd_timeout.sh
+
+# Test with timeout shorter than command duration
+sed -i.bak "s|delay_cmd_multi_down.sh|delay_cmd_timeout.sh|" etc/pgpool.conf
+echo "replication_delay_source_timeout = 2" >> etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for timeout
+echo "Waiting for timeout test (2s timeout, 5s command)..."
+for i in {1..10}; do
+    if grep -q "replication delay command timed out after 2 seconds" log/pgpool.log 2>/dev/null; then
+        echo "Timeout detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Check for timeout message
+grep "replication delay command timed out after 2 seconds" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: timeout not detected
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: timeout test succeeded
+./shutdownall
+
+# Test with timeout longer than command duration
+sed -i.bak "s|replication_delay_source_timeout = 2|replication_delay_source_timeout = 10|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for successful execution
+echo "Waiting for successful execution (10s timeout, 5s command)..."
+for i in {1..15}; do
+    if grep -q "executing replication delay command.*delay_cmd_timeout.sh" log/pgpool.log 2>/dev/null; then
+        echo "Command executed successfully after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Should not timeout this time
+if grep -q "replication delay command timed out" log/pgpool.log 2>/dev/null; then
+    echo fail: command should not have timed out with 10s timeout
+    ./shutdownall
+    exit 1
+fi
+
+echo ok: extended timeout test succeeded
+./shutdownall
+
+# ----------------------------------------------------------------------------------------
+echo "=== Test7: Mix of valid delays and -1 ==="
+# ----------------------------------------------------------------------------------------
+cat > delay_cmd_mixed.sh << 'EOF'
+#!/bin/bash
+# One replica up (25ms), one down (-1)
+echo "25 -1"
+EOF
+chmod +x delay_cmd_mixed.sh
+
+sed -i.bak "s|delay_cmd_timeout.sh|delay_cmd_mixed.sh|" etc/pgpool.conf
+
+./startall
+wait_for_pgpool_startup
+
+# Wait for sr_check
+echo "Waiting for mixed delay test..."
+for i in {1..10}; do
+    if grep -q "node.*reported as down by external command" log/pgpool.log 2>/dev/null; then
+        echo "Mixed delay handling detected after ${i} seconds"
+        break
+    fi
+    sleep 1
+done
+
+# Should log one -1 and process one normal delay
+grep "node.*reported as down by external command.*delay -1" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo fail: -1 not logged
+    ./shutdownall
+    exit 1
+fi
+
+# Should also log the normal replica delay
+grep "Replication of node.*external command" log/pgpool.log >/dev/null 2>&1
+if [ $? != 0 ];then
+    echo "Note: Normal replica delay logging may not be visible with log_standby_delay settings"
+fi
+
+echo ok: mixed delay handling test succeeded
+./shutdownall
+
+echo "All validation tests passed!"
+exit 0
\ No newline at end of file
-- 
2.52.0



^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2026-01-05 23:52  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2026-01-05 23:52 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

> Thanks for the help! please find attached the latest version with all
> changes and test passing.

Thanks for updating patch! I confirmed that all test have passed on my
Ubunu box.  Now I am working on Japanese document. While working on
it, I did followings to the English document. (see attached)

- Reformat it so that each line is not too long. Like PostgreSQL, I
  bend each line at most 78 chars. (I know other parts of document do
  not follow the rule but I do not want to add more lines not
  following the rule).

- Move replication_delay_source_cmd (string) and
  replication_delay_source_timeout (integer) at the bottom of
  "5.12. Streaming Replication Check" section. We usually add the new
  parameters at the bottom of the page if there's no particular
  reason. Previously they were in between prefer_lower_delay_standby
  and log_standby_delay.

- Add following to replication_delay_source_cmd.  "The line can be
  terminated with or without a new line character." This is observed
  from the implementation. I believe this matters for those who try to
  implement replication_delay_source_cmd.

Lastly I have one question.

replication_delay_source_timeout (integer)

    Specifies the timeout in seconds for the external command
    specified by replication_delay_source_cmd. If the command does not
    finish within the timeout, Pgpool-II logs an error and continues
    using the built-in method.

It seems this ("continues using the built-in method") is different
from the actual behavior. It seems that after timeout, the external
command is tried and timeout.... Do you want to fix the source code to
match with the document? Or change (just remove ""continues using the
built-in method") the document? I am fine with changing the document.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp


Attachments:

  [text/sgml] stream-check.sgml (15.9K, 2-stream-check.sgml)
  download | inline:
<!-- doc/src/sgml/config.sgml -->

<sect1 id="runtime-streaming-replication-check">
 <title>Streaming Replication Check</title>

 <para>
  <productname>Pgpool-II</productname> can work with <productname>PostgreSQL</> native
  Streaming Replication, that is available since <productname>PostgreSQL</> 9.0.
  To configure <productname>Pgpool-II</productname> with streaming
  replication, set
   <xref linkend="guc-backend-clustering-mode"> to <literal>'streaming-replication'</literal>.
 </para>
 <para>
  <productname>Pgpool-II</productname> assumes that Streaming Replication
  is configured with Hot Standby on PostgreSQL, which means that the
  standby database can handle read-only queries.
 </para>

 <variablelist>

  <varlistentry id="guc-sr-check-period" xreflabel="sr_check_period">
   <term><varname>sr_check_period</varname> (<type>integer</type>)
    <indexterm>
     <primary><varname>sr_check_period</varname> configuration parameter</primary>
    </indexterm>
   </term>
   <listitem>

    <para>
     Specifies the time interval in seconds to check the streaming
     replication delay.
     The default is 10.
    </para>

    <para>
     This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
    </para>

   </listitem>
  </varlistentry>

  <varlistentry id="guc-sr-check-user" xreflabel="sr_check_user">
   <term><varname>sr_check_user</varname> (<type>string</type>)
    <indexterm>
     <primary><varname>sr_check_user</varname> configuration parameter</primary>
    </indexterm>
   </term>
   <listitem>

    <para>
     Specifies the <productname>PostgreSQL</> user name to perform streaming replication check.
     Default is <literal>''</literal>(empty).
     The user must have LOGIN privilege and exist on all the
     <productname>PostgreSQL</> backends.
     Moreover the user must be a <productname>PostgreSQL</productname>
     super user or in "pg_monitor" group.
   <note>
    <para>
     To make <xref linkend="guc-sr-check-user"> in pg_monitor
     group, execute following SQL command
     by <productname>PostgreSQL</productname> super user (replace
     "sr_check_user" with the setting of <xref linkend="guc-sr-check-user">):
     <programlisting>
GRANT pg_monitor TO sr_check_user;
     </programlisting>
     For <productname>PostgreSQL</productname> 9.6, there's no
     pg_monitor group and <xref linkend="guc-sr-check-user"> must
     be <productname>PostgreSQL</productname> super user.
    </para>
   </note>

    </para>
    <para>
     If <link linkend="runtime-ssl">SSL</link> is enabled, the
     streaming replication check process may use SSL connection.
    </para> 
    <note>
     <para>
      <xref linkend="guc-sr-check-user"> and <xref
      linkend="guc-sr-check-password"> are used even when <xref
      linkend="guc-sr-check-period"> is set to 0 (disabled) for the
      identification of the primary server.
     </para>
    </note>

    <para>
     This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
    </para>

   </listitem>
  </varlistentry>

  <varlistentry id="guc-sr-check-password" xreflabel="sr_check_password">
   <term><varname>sr_check_password</varname> (<type>string</type>)
    <indexterm>
     <primary><varname>sr_check_password</varname> configuration parameter</primary>
    </indexterm>
   </term>
   <listitem>

    <para>
     Specifies the password of the <xref linkend="guc-sr-check-user"> <productname>PostgreSQL</> user
      to perform the streaming replication checks.
      Use <literal>''</literal> (empty string) if the user does not requires a password.
    </para>
    <para>
     If <varname>sr_check_password</varname> is left blank <productname>Pgpool-II</productname>
     will first try to get the password for <xref linkend="guc-sr-check-user"> from
      <xref linkend="guc-pool-passwd"> file before using the empty password.
    </para>

    <para>
     <productname>Pgpool-II</productname> accepts following forms
     of password in either <varname>sr_check_password</varname>
     or <xref linkend="guc-pool-passwd"> file:
      <variablelist>

       <varlistentry>
	<term>AES256-CBC encrypted password</term>
	<listitem>
	 <para>
	  Most secure and recommended way to store password. The
	  password string must be prefixed
	  with <literal>AES</literal>.
	  You can use <xref linkend="PG-ENC"> utility to create the correctly formatted
	   <literal>AES</literal> encrypted password strings.
	   <productname>Pgpool-II</productname> will require a valid decryption key at the
	   startup to use the encrypted passwords.
	   see <xref linkend="auth-aes-decryption-key"> for more details on providing the
	    decryption key to <productname>Pgpool-II</productname>
	 </para>
	</listitem>
       </varlistentry>

       <varlistentry>
	<term>MD5 hashed password</term>
	<listitem>
	 <para>
	  Not so secure as AES256, but still better than clear
	  text password. The password string must be prefixed
	  with <literal>MD5</literal>. Note that the backend
	  must set up MD5 authentication as well.  You can
	  use <xref linkend="PG-MD5"> utility to create the
	   correctly formatted
	   <literal>MD5</literal> hashed password strings.
	 </para>
	</listitem>
       </varlistentry>

       <varlistentry>
	<term>Plain text password</term>
	<listitem>
	 <para>
	  Not encrypted, clear text password. You should avoid
	  to use this if possible. The password string must be
	  prefixed with <literal>TEXT</literal>. For example if
	  you want to set <literal>mypass</literal> as a
	  password, you should
	  specify <literal>TEXTmypass</literal> in the password
	  field.  In the absence of a valid
	  prefix, <productname>Pgpool-II</productname> will
	  considered the string as a plain text password.
	 </para>
	</listitem>
       </varlistentry>

      </variablelist>
    </para>

    <para>
     This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
    </para>

   </listitem>
  </varlistentry>

  <varlistentry id="guc-sr-check-database" xreflabel="sr_check_database">
   <term><varname>sr_check_database</varname> (<type>string</type>)
    <indexterm>
     <primary><varname>sr_check_database</varname> configuration parameter</primary>
    </indexterm>
   </term>
   <listitem>

    <para>
     Specifies the database to perform streaming replication delay checks.
     The default is <literal>"postgres"</literal>.
    </para>
    <para>
     This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
    </para>

   </listitem>
  </varlistentry>

  <varlistentry id="guc-delay-threshold" xreflabel="delay_threshold">
   <term><varname>delay_threshold</varname> (<type>integer</type>)
    <indexterm>
     <primary><varname>delay_threshold</varname> configuration parameter</primary>
    </indexterm>
   </term>
   <listitem>

    <para>
     Specifies the maximum tolerance level of replication delay in
     <acronym>WAL</acronym> bytes on the standby server against the
     primary server. If the delay exceeds this configured level,
     <productname>Pgpool-II</productname> stops sending the <acronym>
      SELECT</acronym> queries to the standby server and starts routing
     everything to the primary server even if <xref linkend="guc-load-balance-mode">
      is enabled, until the standby catches-up with the primary.
      Setting this parameter to 0 disables the delay checking.
      This delay threshold check is performed every <xref linkend="guc-sr-check-period">.
       Default is 0.
    </para>

    <para>
     This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
    </para>

   </listitem>
  </varlistentry>

  <varlistentry id="guc-delay-threshold-by-time" xreflabel="delay_threshold_by_time">
   <term><varname>delay_threshold_by_time</varname> (<type>integer</type>)
    <indexterm>
     <primary><varname>delay_threshold_by_time</varname> configuration parameter</primary>
    </indexterm>
   </term>
   <listitem>

    <para>
     Specifies the maximum tolerance level of replication delay 
     on the standby server against the primary server.
     If this value is specified without units, it is taken as milliseconds.
     If the specified value is greater than
     0, <xref linkend="guc-delay-threshold"> is ignored.  If the delay
     exceeds this configured level,
     <productname>Pgpool-II</productname> stops sending the <acronym>
      SELECT</acronym> queries to the standby server and starts routing
     everything to the primary server even if <xref linkend="guc-load-balance-mode">
      is enabled, until the standby catches-up with the primary.
      Setting this parameter to 0 disables the delay checking.
      This delay threshold check is performed every <xref linkend="guc-sr-check-period">.
       Default is 0.
    </para>

    <para>
      Replication delay is taken
      from <productname>PostgreSQL</productname>'s system
      view <structname>pg_stat_replication</structname>.<structfield>replay_lag</structfield>. The
      view is available <productname>PostgreSQL</productname> 10 or
      later. If earlier version
      of <productname>PostgreSQL</productname> is
      used, <productname>Pgpool-II</productname> automatically falls
      back to <xref linkend="guc-delay-threshold">
      and <xref linkend="guc-delay-threshold-by-time"> is ignored.
    </para>

    <para>
     This parameter relies
     on <xref linkend="guc-backend-application-name"> being correctly
     set and matching <varname>application_name</varname> in
     your <productname>PostgreSQL</productname> standby's
     primary_conninfo.
    </para>

    <para>
      If this parameter is
      enabled, <xref linkend="sql-show-pool-nodes">
      and <xref linkend="pcp-node-info"> show replication delay in
      seconds, rather than bytes.
    </para>

    <para>
     This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
    </para>

   </listitem>
  </varlistentry>

  <varlistentry id="guc-prefer-lower-delay-standby" xreflabel="prefer_lower_delay_standby">
    <term><varname>prefer_lower_delay_standby</varname> (<type>boolean</type>)
     <indexterm>
      <primary><varname>prefer_lower_delay_standby</varname> configuration parameter</primary>
     </indexterm>
    </term>
    <listitem>
     <para>
      This parameter is valid only
      when <xref linkend="guc-delay-threshold">
      or <xref linkend="guc-delay-threshold-by-time"> is set to
      greater than 0.  When set to on, if the delay of the load
      balancing node is greater
      than <xref linkend="guc-delay-threshold">
      or <xref linkend="guc-delay-threshold-by-time">,
      <productname>Pgpool-II</productname> does not send read queries
      to the primary node but the least delay standby with
      backend_weight to greater than 0. If delay of all standby nodes
      are greater than <xref linkend="guc-delay-threshold">
      or <xref linkend="guc-delay-threshold-by-time"> the primary
      selected as the load balancing node
      first, <productname>Pgpool-II</productname> sends to the
      primary.  Default is off.
     </para>
     <para>
      This parameter can be changed by reloading the <productname>Pgpool-II</productname> configurations.
     </para>
    </listitem>
  </varlistentry>

  <varlistentry id="guc-log-standby-delay" xreflabel="log_standby_delay">
   <term><varname>log_standby_delay</varname> (<type>enum</type>)
    <indexterm>
     <primary><varname>log_standby_delay</varname> configuration parameter</primary>
    </indexterm>
   </term>
   <listitem>

    <para>
     Specifies when to log the replication delay. Below table contains the list
     of all valid values for the parameter.
    </para>

    <table id="log-standby-delay-table">
     <title>Log standby delay options</title>
     <tgroup cols="2">
      <thead>
       <row>
	<entry>Value</entry>
	<entry>Description</entry>
       </row>
      </thead>

      <tbody>
       <row>
	<entry><literal>none</literal></entry>
	<entry>Never log the standby delay</entry>
       </row>

       <row>
	<entry><literal>always</literal></entry>
	<entry>Log the standby delay if it's greater than 0, every time the replication delay is checked</entry>
       </row>

       <row>
	<entry><literal>if_over_threshold</literal></entry>
	<entry>Only log the standby delay, when it exceeds <xref linkend="guc-delay-threshold"> or <xref linkend="guc-delay-threshold-by-time"> value (the default)</entry>
       </row>
      </tbody>
     </tgroup>
    </table>

    <para>
     This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
    </para>
   </listitem>
  </varlistentry>

  <varlistentry id="guc-replication-delay-source-cmd" xreflabel="replication_delay_source_cmd">
   <term><varname>replication_delay_source_cmd</varname> (<type>string</type>)
    <indexterm>
     <primary><varname>replication_delay_source_cmd</varname> configuration parameter</primary>
    </indexterm>
   </term>
   <listitem>
    <para>
     Specifies an external command to retrieve replication delay information
     for replica nodes.  When this parameter is set and not
     empty, <productname>Pgpool-II</productname> uses the external command
     instead of built-in database queries to obtain replication delays.  The
     command is executed as the <productname>Pgpool-II</productname> process
     user.
    </para>
    <para>
     The command receives replica node identifiers as positional arguments,
     with the primary node omitted. Each identifier is in the
     format <literal>&lt;hostname&gt;:&lt;port&gt;</literal>, for
     example <literal>server1:5432 server2:5432</literal>. The order matches
     <productname>Pgpool-II</productname>'s backend order (excluding the
     primary), allowing the script to correlate external metrics (such as from
     AWS CloudWatch for Aurora) to the correct nodes.
    </para>
    <para>
     The command must write a single line to stdout containing one
     whitespace-separated delay value per replica, in milliseconds, in the
     same order as the arguments. The line can be terminated with or without a
     new line character. The primary node's delay is implicitly zero and
     should not be included in the output. Delay values can be integers or
     floating-point numbers.
    </para>
    <para>
     Special value: <literal>-1</literal> indicates a replica that is down but
     not yet detected by <productname>Pgpool-II</productname>'s health
     checks. <productname>Pgpool-II</productname> will log this condition but
     rely on its own health-check logic to decide whether to trigger failover;
     no failover is triggered solely by receiving <literal>-1</literal>.
    </para>
    <para>
     Example for a 3-node cluster (1 primary + 2 replicas): if the command
     receives arguments
     <literal>server1:5432 server2:5432</literal>, it should
     output <literal>"25.5 100"</literal> to indicate the first replica has
     25.5ms delay and the second has 100ms delay.
    </para>
    <para>
     Default is empty (use built-in replication delay queries).
    </para>
    <para>
     This parameter can be changed by reloading the <productname>Pgpool-II</>
     configurations.
    </para>
   </listitem>
  </varlistentry>

  <varlistentry id="guc-replication-delay-source-timeout" xreflabel="replication_delay_source_timeout">
   <term><varname>replication_delay_source_timeout</varname> (<type>integer</type>)
    <indexterm>
     <primary><varname>replication_delay_source_timeout</varname> configuration parameter</primary>
    </indexterm>
   </term>
   <listitem>
    <para>
     Specifies the timeout in seconds for the external command specified by
     <xref linkend="guc-replication-delay-source-cmd">.  If the command does
      not finish within the timeout, <productname>Pgpool-II</productname> logs
      an error and continues using the built-in method.
    </para>
    <para>
     Default is 10 seconds. Valid range is 1-3600 seconds.
    </para>
    <para>
     This parameter can be changed by reloading the <productname>Pgpool-II</>
     configurations.
    </para>
   </listitem>
  </varlistentry>

 </variablelist>

</sect1>

^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2026-01-05 23:59  Tatsuo Ishii <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2026-01-05 23:59 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

And... attached is the Japanese document. For those who are interested.
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

>> Thanks for the help! please find attached the latest version with all
>> changes and test passing.
> 
> Thanks for updating patch! I confirmed that all test have passed on my
> Ubunu box.  Now I am working on Japanese document. While working on
> it, I did followings to the English document. (see attached)
> 
> - Reformat it so that each line is not too long. Like PostgreSQL, I
>   bend each line at most 78 chars. (I know other parts of document do
>   not follow the rule but I do not want to add more lines not
>   following the rule).
> 
> - Move replication_delay_source_cmd (string) and
>   replication_delay_source_timeout (integer) at the bottom of
>   "5.12. Streaming Replication Check" section. We usually add the new
>   parameters at the bottom of the page if there's no particular
>   reason. Previously they were in between prefer_lower_delay_standby
>   and log_standby_delay.
> 
> - Add following to replication_delay_source_cmd.  "The line can be
>   terminated with or without a new line character." This is observed
>   from the implementation. I believe this matters for those who try to
>   implement replication_delay_source_cmd.
> 
> Lastly I have one question.
> 
> replication_delay_source_timeout (integer)
> 
>     Specifies the timeout in seconds for the external command
>     specified by replication_delay_source_cmd. If the command does not
>     finish within the timeout, Pgpool-II logs an error and continues
>     using the built-in method.
> 
> It seems this ("continues using the built-in method") is different
> from the actual behavior. It seems that after timeout, the external
> command is tried and timeout.... Do you want to fix the source code to
> match with the document? Or change (just remove ""continues using the
> built-in method") the document? I am fine with changing the document.
> 
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp


Attachments:

  [application/octet-stream] stream-check.sgml (28.4K, 2-stream-check.sgml)
  download

^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2026-01-06 03:53  Nadav Shatz <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Nadav Shatz @ 2026-01-06 03:53 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

Thank you so much for the help with this! And for the suggestions.

Let’s change the document to match then implementation in this case.
Should I share an updated patch or will you modify it on merge?

Nadav Shatz
Tailor Brands | CTO


On Tue, Jan 6, 2026 at 1:59 AM Tatsuo Ishii <[email protected]> wrote:

> And... attached is the Japanese document. For those who are interested.
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
> >> Thanks for the help! please find attached the latest version with all
> >> changes and test passing.
> >
> > Thanks for updating patch! I confirmed that all test have passed on my
> > Ubunu box.  Now I am working on Japanese document. While working on
> > it, I did followings to the English document. (see attached)
> >
> > - Reformat it so that each line is not too long. Like PostgreSQL, I
> >   bend each line at most 78 chars. (I know other parts of document do
> >   not follow the rule but I do not want to add more lines not
> >   following the rule).
> >
> > - Move replication_delay_source_cmd (string) and
> >   replication_delay_source_timeout (integer) at the bottom of
> >   "5.12. Streaming Replication Check" section. We usually add the new
> >   parameters at the bottom of the page if there's no particular
> >   reason. Previously they were in between prefer_lower_delay_standby
> >   and log_standby_delay.
> >
> > - Add following to replication_delay_source_cmd.  "The line can be
> >   terminated with or without a new line character." This is observed
> >   from the implementation. I believe this matters for those who try to
> >   implement replication_delay_source_cmd.
> >
> > Lastly I have one question.
> >
> > replication_delay_source_timeout (integer)
> >
> >     Specifies the timeout in seconds for the external command
> >     specified by replication_delay_source_cmd. If the command does not
> >     finish within the timeout, Pgpool-II logs an error and continues
> >     using the built-in method.
> >
> > It seems this ("continues using the built-in method") is different
> > from the actual behavior. It seems that after timeout, the external
> > command is tried and timeout.... Do you want to fix the source code to
> > match with the document? Or change (just remove ""continues using the
> > built-in method") the document? I am fine with changing the document.
> >
> > Best regards,
> > --
> > Tatsuo Ishii
> > SRA OSS K.K.
> > English: http://www.sraoss.co.jp/index_en/
> > Japanese:http://www.sraoss.co.jp
>


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2026-01-06 04:48  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2026-01-06 04:48 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

> Thank you so much for the help with this! And for the suggestions.
> 
> Let’s change the document to match then implementation in this case.
> Should I share an updated patch or will you modify it on merge?

I will modify it on merge. No swear.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

> Nadav Shatz
> Tailor Brands | CTO
> 
> 
> On Tue, Jan 6, 2026 at 1:59 AM Tatsuo Ishii <[email protected]> wrote:
> 
>> And... attached is the Japanese document. For those who are interested.
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
>> >> Thanks for the help! please find attached the latest version with all
>> >> changes and test passing.
>> >
>> > Thanks for updating patch! I confirmed that all test have passed on my
>> > Ubunu box.  Now I am working on Japanese document. While working on
>> > it, I did followings to the English document. (see attached)
>> >
>> > - Reformat it so that each line is not too long. Like PostgreSQL, I
>> >   bend each line at most 78 chars. (I know other parts of document do
>> >   not follow the rule but I do not want to add more lines not
>> >   following the rule).
>> >
>> > - Move replication_delay_source_cmd (string) and
>> >   replication_delay_source_timeout (integer) at the bottom of
>> >   "5.12. Streaming Replication Check" section. We usually add the new
>> >   parameters at the bottom of the page if there's no particular
>> >   reason. Previously they were in between prefer_lower_delay_standby
>> >   and log_standby_delay.
>> >
>> > - Add following to replication_delay_source_cmd.  "The line can be
>> >   terminated with or without a new line character." This is observed
>> >   from the implementation. I believe this matters for those who try to
>> >   implement replication_delay_source_cmd.
>> >
>> > Lastly I have one question.
>> >
>> > replication_delay_source_timeout (integer)
>> >
>> >     Specifies the timeout in seconds for the external command
>> >     specified by replication_delay_source_cmd. If the command does not
>> >     finish within the timeout, Pgpool-II logs an error and continues
>> >     using the built-in method.
>> >
>> > It seems this ("continues using the built-in method") is different
>> > from the actual behavior. It seems that after timeout, the external
>> > command is tried and timeout.... Do you want to fix the source code to
>> > match with the document? Or change (just remove ""continues using the
>> > built-in method") the document? I am fine with changing the document.
>> >
>> > Best regards,
>> > --
>> > Tatsuo Ishii
>> > SRA OSS K.K.
>> > English: http://www.sraoss.co.jp/index_en/
>> > Japanese:http://www.sraoss.co.jp
>>


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2026-01-06 06:13  Nadav Shatz <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Nadav Shatz @ 2026-01-06 06:13 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

Thanks a lot! Looking for the merge.
I’ll start working on the suggestion for the next stage. Hopefully it’ll be
quicker now that I have the dev env better setup

Nadav Shatz
Tailor Brands | CTO


On Tue, Jan 6, 2026 at 6:48 AM Tatsuo Ishii <[email protected]> wrote:

> > Thank you so much for the help with this! And for the suggestions.
> >
> > Let’s change the document to match then implementation in this case.
> > Should I share an updated patch or will you modify it on merge?
>
> I will modify it on merge. No swear.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
> > Nadav Shatz
> > Tailor Brands | CTO
> >
> >
> > On Tue, Jan 6, 2026 at 1:59 AM Tatsuo Ishii <[email protected]>
> wrote:
> >
> >> And... attached is the Japanese document. For those who are interested.
> >> --
> >> Tatsuo Ishii
> >> SRA OSS K.K.
> >> English: http://www.sraoss.co.jp/index_en/
> >> Japanese:http://www.sraoss.co.jp
> >>
> >> >> Thanks for the help! please find attached the latest version with all
> >> >> changes and test passing.
> >> >
> >> > Thanks for updating patch! I confirmed that all test have passed on my
> >> > Ubunu box.  Now I am working on Japanese document. While working on
> >> > it, I did followings to the English document. (see attached)
> >> >
> >> > - Reformat it so that each line is not too long. Like PostgreSQL, I
> >> >   bend each line at most 78 chars. (I know other parts of document do
> >> >   not follow the rule but I do not want to add more lines not
> >> >   following the rule).
> >> >
> >> > - Move replication_delay_source_cmd (string) and
> >> >   replication_delay_source_timeout (integer) at the bottom of
> >> >   "5.12. Streaming Replication Check" section. We usually add the new
> >> >   parameters at the bottom of the page if there's no particular
> >> >   reason. Previously they were in between prefer_lower_delay_standby
> >> >   and log_standby_delay.
> >> >
> >> > - Add following to replication_delay_source_cmd.  "The line can be
> >> >   terminated with or without a new line character." This is observed
> >> >   from the implementation. I believe this matters for those who try to
> >> >   implement replication_delay_source_cmd.
> >> >
> >> > Lastly I have one question.
> >> >
> >> > replication_delay_source_timeout (integer)
> >> >
> >> >     Specifies the timeout in seconds for the external command
> >> >     specified by replication_delay_source_cmd. If the command does not
> >> >     finish within the timeout, Pgpool-II logs an error and continues
> >> >     using the built-in method.
> >> >
> >> > It seems this ("continues using the built-in method") is different
> >> > from the actual behavior. It seems that after timeout, the external
> >> > command is tried and timeout.... Do you want to fix the source code to
> >> > match with the document? Or change (just remove ""continues using the
> >> > built-in method") the document? I am fine with changing the document.
> >> >
> >> > Best regards,
> >> > --
> >> > Tatsuo Ishii
> >> > SRA OSS K.K.
> >> > English: http://www.sraoss.co.jp/index_en/
> >> > Japanese:http://www.sraoss.co.jp
> >>
>


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2026-01-06 06:43  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2026-01-06 06:43 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

> Thanks a lot! Looking for the merge.

I have just pushed the patch. Thank you for the great work!

> I’ll start working on the suggestion for the next stage. Hopefully it’ll be
> quicker now that I have the dev env better setup
>
> Nadav Shatz
> Tailor Brands | CTO

Looking foward to seeing new patch!

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

> On Tue, Jan 6, 2026 at 6:48 AM Tatsuo Ishii <[email protected]> wrote:
> 
>> > Thank you so much for the help with this! And for the suggestions.
>> >
>> > Let’s change the document to match then implementation in this case.
>> > Should I share an updated patch or will you modify it on merge?
>>
>> I will modify it on merge. No swear.
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
>> > Nadav Shatz
>> > Tailor Brands | CTO
>> >
>> >
>> > On Tue, Jan 6, 2026 at 1:59 AM Tatsuo Ishii <[email protected]>
>> wrote:
>> >
>> >> And... attached is the Japanese document. For those who are interested.
>> >> --
>> >> Tatsuo Ishii
>> >> SRA OSS K.K.
>> >> English: http://www.sraoss.co.jp/index_en/
>> >> Japanese:http://www.sraoss.co.jp
>> >>
>> >> >> Thanks for the help! please find attached the latest version with all
>> >> >> changes and test passing.
>> >> >
>> >> > Thanks for updating patch! I confirmed that all test have passed on my
>> >> > Ubunu box.  Now I am working on Japanese document. While working on
>> >> > it, I did followings to the English document. (see attached)
>> >> >
>> >> > - Reformat it so that each line is not too long. Like PostgreSQL, I
>> >> >   bend each line at most 78 chars. (I know other parts of document do
>> >> >   not follow the rule but I do not want to add more lines not
>> >> >   following the rule).
>> >> >
>> >> > - Move replication_delay_source_cmd (string) and
>> >> >   replication_delay_source_timeout (integer) at the bottom of
>> >> >   "5.12. Streaming Replication Check" section. We usually add the new
>> >> >   parameters at the bottom of the page if there's no particular
>> >> >   reason. Previously they were in between prefer_lower_delay_standby
>> >> >   and log_standby_delay.
>> >> >
>> >> > - Add following to replication_delay_source_cmd.  "The line can be
>> >> >   terminated with or without a new line character." This is observed
>> >> >   from the implementation. I believe this matters for those who try to
>> >> >   implement replication_delay_source_cmd.
>> >> >
>> >> > Lastly I have one question.
>> >> >
>> >> > replication_delay_source_timeout (integer)
>> >> >
>> >> >     Specifies the timeout in seconds for the external command
>> >> >     specified by replication_delay_source_cmd. If the command does not
>> >> >     finish within the timeout, Pgpool-II logs an error and continues
>> >> >     using the built-in method.
>> >> >
>> >> > It seems this ("continues using the built-in method") is different
>> >> > from the actual behavior. It seems that after timeout, the external
>> >> > command is tried and timeout.... Do you want to fix the source code to
>> >> > match with the document? Or change (just remove ""continues using the
>> >> > built-in method") the document? I am fine with changing the document.
>> >> >
>> >> > Best regards,
>> >> > --
>> >> > Tatsuo Ishii
>> >> > SRA OSS K.K.
>> >> > English: http://www.sraoss.co.jp/index_en/
>> >> > Japanese:http://www.sraoss.co.jp
>> >>
>>


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2026-01-08 02:19  Tatsuo Ishii <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Tatsuo Ishii @ 2026-01-08 02:19 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

Hi Nadav,

I have just noticed that check_replication_time_lag_with_cmd() can be
simplified by using StringInfo functions. Patch attached. What do you
think?

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp


Attachments:

  [application/octet-stream] refactor_check_replication_time_lag_with_cmd.patch (2.3K, 2-refactor_check_replication_time_lag_with_cmd.patch)
  download | inline diff:
diff --git a/src/streaming_replication/pool_worker_child.c b/src/streaming_replication/pool_worker_child.c
index ee4796437..311b63865 100644
--- a/src/streaming_replication/pool_worker_child.c
+++ b/src/streaming_replication/pool_worker_child.c
@@ -692,7 +692,6 @@ check_replication_time_lag_with_cmd(void)
 	char	   *temp_token;
 	char	   *endptr;
 	char	   *ident;
-	const char *base_command;
 	double		delay_ms;
 	uint64		delay;
 	uint64		delay_threshold_by_time;
@@ -700,8 +699,6 @@ check_replication_time_lag_with_cmd(void)
 	int			primary_node_id;
 	int			save_errno;
 	int			i;
-	size_t		total_len;
-	size_t		current_len;
 	BackendInfo *bkinfo;
 	ErrorContextCallback callback;
 	int			pipefd[2] = {-1, -1};
@@ -712,6 +709,7 @@ check_replication_time_lag_with_cmd(void)
 	ssize_t		bytes_read;
 	int			status;
 	int			num_replicas;
+	StringInfoData strbuf;
 
 	if (NUM_BACKENDS <= 1)
 	{
@@ -760,46 +758,19 @@ check_replication_time_lag_with_cmd(void)
 	/* Execute command as current process user */
 	PG_TRY();
 	{
-		base_command = pool_config->replication_delay_source_cmd;
-		total_len = strlen(base_command) + 1;	/* +1 for NUL */
-
+		initStringInfo(&strbuf);
+		appendStringInfoString(&strbuf,
+							   pool_config->replication_delay_source_cmd);
 		/* Build command with replica-only arguments (omit primary) */
-
-		/*
-		 * Calculate total command length including space-separated replica
-		 * identifiers
-		 */
-		for (i = 0; i < NUM_BACKENDS; i++)
-		{
-			if (i == primary_node_id)
-				continue;		/* Skip primary node */
-
-			ident = build_instance_identifier_for_node(i);
-
-			total_len += 1 /* space */ + strlen(ident);
-			pfree(ident);
-		}
-
-		command = palloc(total_len);
-		strlcpy(command, base_command, total_len);
-
-		/* Append replica identifiers */
-		current_len = strlen(command);
-
 		for (i = 0; i < NUM_BACKENDS; i++)
 		{
 			if (i == primary_node_id)
 				continue;		/* Skip primary node */
-
 			ident = build_instance_identifier_for_node(i);
-
-			/* Append space and identifier */
-			snprintf(command + current_len, total_len - current_len, " %s", ident);
-			current_len += strlen(command + current_len);
-
+			appendStringInfo(&strbuf, " %s", ident);
 			pfree(ident);
 		}
-
+		command = strbuf.data;
 		ereport(DEBUG1,
 				(errmsg("executing replication delay command: %s", command)));
 


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2026-01-08 04:59  Nadav Shatz <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 61+ messages in thread

From: Nadav Shatz @ 2026-01-08 04:59 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

Much much better and cleaner! Thanks!

Nadav Shatz
Tailor Brands | CTO


On Thu, Jan 8, 2026 at 4:19 AM Tatsuo Ishii <[email protected]> wrote:

> Hi Nadav,
>
> I have just noticed that check_replication_time_lag_with_cmd() can be
> simplified by using StringInfo functions. Patch attached. What do you
> think?
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>


^ permalink  raw  reply  [nested|flat] 61+ messages in thread

* Re: Proposal: recent access based routing for primary-replica setups
@ 2026-01-08 05:44  Tatsuo Ishii <[email protected]>
  parent: Nadav Shatz <[email protected]>
  0 siblings, 0 replies; 61+ messages in thread

From: Tatsuo Ishii @ 2026-01-08 05:44 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

Than you for the review!
I will commit the patch.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

> Much much better and cleaner! Thanks!
> 
> Nadav Shatz
> Tailor Brands | CTO
> 
> 
> On Thu, Jan 8, 2026 at 4:19 AM Tatsuo Ishii <[email protected]> wrote:
> 
>> Hi Nadav,
>>
>> I have just noticed that check_replication_time_lag_with_cmd() can be
>> simplified by using StringInfo functions. Patch attached. What do you
>> think?
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>


^ permalink  raw  reply  [nested|flat] 61+ messages in thread


end of thread, other threads:[~2026-01-08 05:44 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-08-17 13:27 Proposal: recent access based routing for primary-replica setups Nadav Shatz <[email protected]>
2025-08-18 12:51 ` Tatsuo Ishii <[email protected]>
2025-08-18 14:11   ` Nadav Shatz <[email protected]>
2025-08-20 12:45     ` Tatsuo Ishii <[email protected]>
2025-08-20 14:27       ` Nadav Shatz <[email protected]>
2025-08-21 05:04         ` Tatsuo Ishii <[email protected]>
2025-08-21 07:38           ` Nadav Shatz <[email protected]>
2025-08-21 10:23             ` Tatsuo Ishii <[email protected]>
2025-08-24 11:11               ` Nadav Shatz <[email protected]>
2025-08-25 02:18                 ` Tatsuo Ishii <[email protected]>
2025-08-25 12:50                   ` Nadav Shatz <[email protected]>
2025-08-26 01:41                     ` Tatsuo Ishii <[email protected]>
2025-08-26 06:54                       ` Nadav Shatz <[email protected]>
2025-09-01 13:34                         ` Nadav Shatz <[email protected]>
2025-09-01 22:41                           ` Tatsuo Ishii <[email protected]>
2025-09-02 08:32                             ` Nadav Shatz <[email protected]>
2025-09-03 23:36                               ` Tatsuo Ishii <[email protected]>
2025-09-07 08:52                                 ` Nadav Shatz <[email protected]>
2025-09-08 00:26                                   ` Tatsuo Ishii <[email protected]>
2025-09-08 09:50                                     ` Nadav Shatz <[email protected]>
2025-09-08 12:02                                       ` Tatsuo Ishii <[email protected]>
2025-09-09 00:39                                       ` Tatsuo Ishii <[email protected]>
2025-09-15 12:48                                         ` Nadav Shatz <[email protected]>
2025-09-16 10:30                                           ` Tatsuo Ishii <[email protected]>
2025-09-20 23:57                                             ` Nadav Shatz <[email protected]>
2025-09-21 22:34                                               ` Tatsuo Ishii <[email protected]>
2025-09-29 12:24                                                 ` Nadav Shatz <[email protected]>
2025-09-30 09:35                                                   ` Tatsuo Ishii <[email protected]>
2025-10-29 10:43                                                     ` Nadav Shatz <[email protected]>
2025-10-30 23:45                                                       ` Tatsuo Ishii <[email protected]>
2025-11-01 06:36                                                         ` Tatsuo Ishii <[email protected]>
2025-11-02 11:23                                                           ` Nadav Shatz <[email protected]>
2025-11-03 07:05                                                             ` Tatsuo Ishii <[email protected]>
2025-11-05 10:37                                                               ` Nadav Shatz <[email protected]>
2025-11-06 09:24                                                                 ` Tatsuo Ishii <[email protected]>
2025-11-18 11:37                                                                   ` Nadav Shatz <[email protected]>
2025-11-19 23:09                                                                     ` Tatsuo Ishii <[email protected]>
2025-11-23 09:53                                                                       ` Nadav Shatz <[email protected]>
2025-11-24 07:41                                                                         ` Tatsuo Ishii <[email protected]>
2025-12-21 11:06                                                                           ` Nadav Shatz <[email protected]>
2025-12-23 00:13                                                                             ` Tatsuo Ishii <[email protected]>
2025-12-23 06:28                                                                               ` Nadav Shatz <[email protected]>
2025-12-23 08:46                                                                                 ` Tatsuo Ishii <[email protected]>
2025-12-23 14:03                                                                                   ` Nadav Shatz <[email protected]>
2025-12-24 02:46                                                                                     ` Tatsuo Ishii <[email protected]>
2025-12-26 07:15                                                                                     ` Tatsuo Ishii <[email protected]>
2025-12-26 07:54                                                                                     ` Tatsuo Ishii <[email protected]>
2025-12-26 10:03                                                                                       ` Tatsuo Ishii <[email protected]>
2025-12-28 12:21                                                                                         ` Nadav Shatz <[email protected]>
2025-12-28 23:48                                                                                           ` Tatsuo Ishii <[email protected]>
2025-12-28 23:58                                                                                             ` Tatsuo Ishii <[email protected]>
2025-12-29 09:31                                                                                               ` Nadav Shatz <[email protected]>
2026-01-05 23:52                                                                                                 ` Tatsuo Ishii <[email protected]>
2026-01-05 23:59                                                                                                   ` Tatsuo Ishii <[email protected]>
2026-01-06 03:53                                                                                                     ` Nadav Shatz <[email protected]>
2026-01-06 04:48                                                                                                       ` Tatsuo Ishii <[email protected]>
2026-01-06 06:13                                                                                                         ` Nadav Shatz <[email protected]>
2026-01-06 06:43                                                                                                           ` Tatsuo Ishii <[email protected]>
2026-01-08 02:19                                                                                                             ` Tatsuo Ishii <[email protected]>
2026-01-08 04:59                                                                                                               ` Nadav Shatz <[email protected]>
2026-01-08 05:44                                                                                                                 ` Tatsuo Ishii <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox