MIME-Version: 1.0
References: 
 <CACeKOO2E6cuCOQGFzq8i0+pFwi=JG4deiapHGkShjMjbn_-6tw@mail.gmail.com>
 <20250818.215106.1325564662459771705.ishii@postgresql.org>
In-Reply-To: <20250818.215106.1325564662459771705.ishii@postgresql.org>
From: Nadav Shatz <nadav@tailorbrands.com>
Date: Mon, 18 Aug 2025 17:11:42 +0300
Message-ID: 
 <CACeKOO1-qront3LzcxOwjJBJz_jGYE9av4SrBa90SpTydPvY=Q@mail.gmail.com>
Subject: Re: Proposal: recent access based routing for primary-replica setups
To: Tatsuo Ishii <ishii@postgresql.org>
Cc: pgpool-hackers@lists.postgresql.org
Content-Type: multipart/alternative; boundary="0000000000005df85f063ca45274"
Archived-At: 
 <https://www.postgresql.org/message-id/CACeKOO1-qront3LzcxOwjJBJz_jGYE9av4SrBa90SpTydPvY%3DQ%40mail.gmail.com>
Precedence: bulk

--0000000000005df85f063ca45274
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi Tatsuo,

Thank you very much for your reply and questions!
I'll try and respond to everything inline, please let me know if I missed
something or if anything isn't clear enough.

On Mon, Aug 18, 2025 at 3:51=E2=80=AFPM Tatsuo Ishii <ishii@postgresql.org>=
 wrote:

> Hello Nadav,
>
> Thank you for the proposal. I have a few questions.
>
> > Hello all,
> >
> > My name is Nadav Shatz, I=E2=80=99m the CTO at Tailor Brands and have b=
een
> working
> > with PostgreSQL in high-traffic, distributed environments for many year=
s.
> > Most of my focus has been on backend architecture, scaling, and
> performance
> > optimization, and I=E2=80=99m a long-time user and admirer of the Postg=
res
> > ecosystem.
> >
> > I=E2=80=99d like to propose adding a feature to pgpool-II for *recent a=
ccess
> based
> > routing* in primary-replica setups. The idea is similar to what we=E2=
=80=99ve
> > described in this article
> > <
> https://medium.com/tailor-tech/using-database-read-replicas-in-distribute=
d-systems-d80eaf6bbf8a
> >,
> > and is also reflected in this pgcat PR
> > <https://github.com/postgresml/pgcat/pull/864>. The core concept is to
> > route read queries to the primary if they occur shortly after a write,
> > reducing replica lag inconsistencies while still benefiting from read
> > scaling.
> >
> > *How it would work (high-level):*
> >
> >
> >    -
> >
> >
> > *External =E2=80=9Ceffective lag=E2=80=9D via config (hot-reloaded): *I=
nstead of relying
> on
> >    pgpool-II=E2=80=99s replication delay checks (which don=E2=80=99t ma=
p well to Aurora
> >    semantics), we=E2=80=99ll expose a *config value* representing the e=
ffective
> >    replica lag (or directly the TTL to use for =E2=80=9Crecency=E2=80=
=9D). This value
> > is *pushed
> >    by an external controller* and *hot-reloaded* (no restarts). The
> >    relevant knobs might look like:
> >
> >    -
> >
> >       enable_recent_access_routing (boolean, default off)
> >       -
> >
> >       recent_access_ttl_ms (integer, default 0, can be hot-reloaded)
>
> If my understanding is correct, the "external controller" updates
> "recent_access_ttl_ms" to let pgpool know the current delay of
> replica. My question is, what if there are multiple replicas. In this
> case the "external controller" calculates the average latency of each
> replica?
>
> Another question is, how often the external controller updates and
> reload pgpool.conf. If it's like every second, probably it could give
> unacceptable load to pgpool because reloading pgpool.conf is expensive
> operation.
>
>
You understood correctly - my plan was to keep it as generic as possible
and leave all logic to be handled by the external controller. Basically
leaving all of these decisions (how often to update, calculation, etc.) to
the external implementation as it can get very case specific.
This approach comes from the need of replica lag understanding under AWS
Aurora - which doesn't expose these metrics from the DB itself.
I also thought of implementing a couple of other possible mechanisms:
1. use a pcp command like you suggest below, i wasn't aware of the option,
this will handle the expensive operation but no other concerns mentioned.
2. we can implement support to using the AWS Aurora API directly for the
lag, while this is cloud provider and db "flavor" specific, it is a very
large and common use case. Doing this will open up all other pgpool
features that rely on the lag values being available. From a performance
perspective it is probably best.

>       enable_query_parser (boolean, required for this feature, default
> off)
>
> What does this do? Why do you need this?
>

this was referring to enabling the auto routing already existing in pgpool
(based on query content), the naming is wrong.
basically meant to say - if the auto routing is disabled, there is no point
in enabling the latest access based routing.
Sorry for the confusion.


>
> > *In-memory recent-access map: *Each worker maintains a lightweight per-=
DB
> >    in-memory map of *recently written relations*. On any write
>
> Is "per-DB in-memory map" in shared memory?
>

Yes


>
> >    (INSERT/UPDATE/DELETE/UPSERT/TRUNCATE), we record the touched
> relations
> >    with a TTL derived from recent_access_ttl_ms. Entries expire
> >    automatically; writes refresh them.
>
> How do you automatically expire the entries? Are you going to
> implement something like a auto sweeper process?
>

Great question - maybe combine that with a lazy deletion process on read.
similar to what memcached is doing.


>
> > *Routing + query parsing: *For incoming statements we parse enough to
> >    answer two questions: (1) is it a read or a write? and (2) which
> relations
> >    are referenced? If a read touches any =E2=80=9Crecently written=E2=
=80=9D relation, we
> *force
> >    route to primary*; otherwise we allow normal read load-balancing to
> >    replicas.
>
> Pgpool-II already does (1) and (2).
>

1 - of course, i'm trying to build on top of it.
2 - maybe i'm not understanding the existing documentation correctly - but
i couldn't find something that takes the specific relations (tables) under
consideration, only query type (Read/Write) or passing the delay_threshold.
Our approach here basically accepts no delay for these specific relations.
so you get guaranteed data freshness at the expense of checking the
specific table. it's a different kind of tradeoff.
the whole approach can be expanded to take further "generic values" under
considerations if needed to also take "tenant" id for instance under
consideration. tho for those cases, using a table per tenant already solves
that.

Please let me know what i might be missing here.


>
> > *Notes on behavior & ops:*
> >
> >
> >    -
> >
> >    *Config & hot reload:* Operators (or an external controller) can
> update
> >    recent_access_ttl_ms dynamically and trigger hot reload to adapt to
> >    changing conditions=E2=80=95no reliance on Aurora internals.
> >    -
> >
> >    *Safety levers:* a global max TTL, optional allow/deny lists, and
> >    metrics (e.g., =E2=80=9Creads forced to primary due to recency=E2=80=
=9D) for
> visibility.
>
> Please elaborate more on this. Allow/deny what?
>

We can add "table list" that would ignore the feature, or in reverse as an
allow list that would enable it only for specific tables. I don't think
that's needed, especially not for V1.


> >    *Defaults & compatibility:* all defaults are safe/off; enabling
> requires
> >    explicit opt-in.
>
> Sounds good.
>
> > I=E2=80=99ll prepare the code changes and send a patch/PR, but before d=
iving in I
> > wanted to check if anyone has *objections, concerns, or preferred
> > alternatives*=E2=80=95particularly around parser hooks, shared memory u=
se, or
> > hot-reload mechanics in pgpool-II.
>
> Probably you should consider adding a pcp command to notice pgpool the
> "recent_access_ttl_ms". That is far more efficient than reloading
> pgpool.conf.


Great idea! i wasn't aware of the mechanism to be honest.

lastly another note that came up - we can disable the feature and load
balancing in case that we have to evict old items in the map. or have it
configurable how to behave in such a scenario.


>


> > Thanks for considering,
> > --
> > Nadav Shatz
> > Tailor Brands | CTO
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>

Best regards,
--=20
Nadav Shatz
Tailor Brands | CTO

--0000000000005df85f063ca45274
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr">Hi Tatsuo,<div><br></div><div>Thank you v=
ery much for your reply and questions!</div><div>I&#39;ll try and respond t=
o everything inline, please let me know if I missed something or if anythin=
g isn&#39;t clear enough.</div></div><br><div class=3D"gmail_quote gmail_qu=
ote_container"><div dir=3D"ltr" class=3D"gmail_attr">On Mon, Aug 18, 2025 a=
t 3:51=E2=80=AFPM Tatsuo Ishii &lt;<a href=3D"mailto:ishii@postgresql.org">=
ishii@postgresql.org</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quo=
te" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204=
);padding-left:1ex">Hello Nadav,<br>
<br>
Thank you for the proposal. I have a few questions.<br>
<br>
&gt; Hello all,<br>
&gt; <br>
&gt; My name is Nadav Shatz, I=E2=80=99m the CTO at Tailor Brands and have =
been working<br>
&gt; with PostgreSQL in high-traffic, distributed environments for many yea=
rs.<br>
&gt; Most of my focus has been on backend architecture, scaling, and perfor=
mance<br>
&gt; optimization, and I=E2=80=99m a long-time user and admirer of the Post=
gres<br>
&gt; ecosystem.<br>
&gt; <br>
&gt; I=E2=80=99d like to propose adding a feature to pgpool-II for *recent =
access based<br>
&gt; routing* in primary-replica setups. The idea is similar to what we=E2=
=80=99ve<br>
&gt; described in this article<br>
&gt; &lt;<a href=3D"https://medium.com/tailor-tech/using-database-read-repl=
icas-in-distributed-systems-d80eaf6bbf8a" rel=3D"noreferrer" target=3D"_bla=
nk">https://medium.com/tailor-tech/using-database-read-replicas-in-distribu=
ted-systems-d80eaf6bbf8a</a>&gt;,<br>
&gt; and is also reflected in this pgcat PR<br>
&gt; &lt;<a href=3D"https://github.com/postgresml/pgcat/pull/864" rel=3D"no=
referrer" target=3D"_blank">https://github.com/postgresml/pgcat/pull/864</a=
>&gt;. The core concept is to<br>
&gt; route read queries to the primary if they occur shortly after a write,=
<br>
&gt; reducing replica lag inconsistencies while still benefiting from read<=
br>
&gt; scaling.<br>
&gt; <br>
&gt; *How it would work (high-level):*<br>
&gt; <br>
&gt; <br>
&gt;=C2=A0 =C2=A0 -<br>
&gt; <br>
&gt; <br>
&gt; *External =E2=80=9Ceffective lag=E2=80=9D via config (hot-reloaded): *=
Instead of relying on<br>
&gt;=C2=A0 =C2=A0 pgpool-II=E2=80=99s replication delay checks (which don=
=E2=80=99t map well to Aurora<br>
&gt;=C2=A0 =C2=A0 semantics), we=E2=80=99ll expose a *config value* represe=
nting the effective<br>
&gt;=C2=A0 =C2=A0 replica lag (or directly the TTL to use for =E2=80=9Crece=
ncy=E2=80=9D). This value<br>
&gt; is *pushed<br>
&gt;=C2=A0 =C2=A0 by an external controller* and *hot-reloaded* (no restart=
s). The<br>
&gt;=C2=A0 =C2=A0 relevant knobs might look like:<br>
&gt;<br>
&gt;=C2=A0 =C2=A0 -<br>
&gt; <br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0enable_recent_access_routing (boolean, defau=
lt off)<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0-<br>
&gt; <br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0recent_access_ttl_ms (integer, default 0, ca=
n be hot-reloaded)<br>
<br>
If my understanding is correct, the &quot;external controller&quot; updates=
<br>
&quot;recent_access_ttl_ms&quot; to let pgpool know the current delay of<br=
>
replica. My question is, what if there are multiple replicas. In this<br>
case the &quot;external controller&quot; calculates the average latency of =
each<br>
replica?<br>
<br>
Another question is, how often the external controller updates and<br>
reload pgpool.conf. If it&#39;s like every second, probably it could give<b=
r>
unacceptable load to pgpool because reloading pgpool.conf is expensive<br>
operation.<br>
<br></blockquote><div><br></div><div>You understood correctly - my plan was=
 to keep it as generic as possible and leave all logic to be handled by the=
 external controller. Basically leaving all of these decisions (how often t=
o update, calculation, etc.) to the external implementation as it can get v=
ery case specific.</div><div>This approach comes from the need of replica l=
ag understanding under AWS Aurora - which doesn&#39;t expose these metrics =
from the DB itself.</div><div>I also thought of implementing a couple of ot=
her possible mechanisms:</div><div>1. use a pcp command like you suggest be=
low, i wasn&#39;t aware of the option, this will handle the expensive opera=
tion but no other concerns mentioned.</div><div>2. we can implement support=
 to using the AWS Aurora API directly for the lag, while this is cloud prov=
ider and db &quot;flavor&quot; specific, it is a very large and common use =
case. Doing this will open up all other pgpool features that rely on the la=
g values being available. From a performance perspective it is probably bes=
t.</div><div><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0p=
x 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0enable_query_parser (boolean, required for t=
his feature, default off)<br>
<br>
What does this do? Why do you need this?<br></blockquote><div><br></div><di=
v>this was referring to enabling the auto routing already existing in pgpoo=
l (based on query content), the naming is wrong.=C2=A0</div><div>basically =
meant to say - if the auto routing is disabled, there is no point in enabli=
ng the latest access based routing.</div><div>Sorry for the confusion.</div=
><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px=
 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
&gt; *In-memory recent-access map: *Each worker maintains a lightweight per=
-DB<br>
&gt;=C2=A0 =C2=A0 in-memory map of *recently written relations*. On any wri=
te<br>
<br>
Is &quot;per-DB in-memory map&quot; in shared memory?<br></blockquote><div>=
<br></div><div>Yes</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" =
style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);pa=
dding-left:1ex">
<br>
&gt;=C2=A0 =C2=A0 (INSERT/UPDATE/DELETE/UPSERT/TRUNCATE), we record the tou=
ched relations<br>
&gt;=C2=A0 =C2=A0 with a TTL derived from recent_access_ttl_ms. Entries exp=
ire<br>
&gt;=C2=A0 =C2=A0 automatically; writes refresh them.<br>
<br>
How do you automatically expire the entries? Are you going to<br>
implement something like a auto sweeper process?<br></blockquote><div><br><=
/div><div>Great question - maybe combine that with a lazy deletion process =
on read. similar to what memcached is doing.</div><div>=C2=A0</div><blockqu=
ote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px=
 solid rgb(204,204,204);padding-left:1ex">
<br>
&gt; *Routing + query parsing: *For incoming statements we parse enough to<=
br>
&gt;=C2=A0 =C2=A0 answer two questions: (1) is it a read or a write? and (2=
) which relations<br>
&gt;=C2=A0 =C2=A0 are referenced? If a read touches any =E2=80=9Crecently w=
ritten=E2=80=9D relation, we *force<br>
&gt;=C2=A0 =C2=A0 route to primary*; otherwise we allow normal read load-ba=
lancing to<br>
&gt;=C2=A0 =C2=A0 replicas.<br>
<br>
Pgpool-II already does (1) and (2).<br></blockquote><div><br></div><div>1 -=
 of course, i&#39;m trying to build on top of it.</div><div>2 - maybe i&#39=
;m not understanding the existing documentation correctly - but i couldn=
9;t find something that takes the specific relations (tables) under conside=
ration, only query type (Read/Write) or passing the delay_threshold.</div><=
div>Our approach here basically accepts no delay for these specific relatio=
ns. so you get guaranteed data freshness at the expense of checking the spe=
cific table. it&#39;s a different kind of tradeoff.</div><div>the whole app=
roach can be expanded to take further &quot;generic values&quot; under cons=
iderations if needed to also take &quot;tenant&quot; id for instance under =
consideration. tho for those cases, using a table per tenant already solves=
 that.</div><div><br></div><div>Please let me know what i might be missing =
here.=C2=A0</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding=
-left:1ex">
<br>
&gt; *Notes on behavior &amp; ops:*<br>
&gt; <br>
&gt; <br>
&gt;=C2=A0 =C2=A0 -<br>
&gt; <br>
&gt;=C2=A0 =C2=A0 *Config &amp; hot reload:* Operators (or an external cont=
roller) can update<br>
&gt;=C2=A0 =C2=A0 recent_access_ttl_ms dynamically and trigger hot reload t=
o adapt to<br>
&gt;=C2=A0 =C2=A0 changing conditions=E2=80=95no reliance on Aurora interna=
ls.<br>
&gt;=C2=A0 =C2=A0 -<br>
&gt; <br>
&gt;=C2=A0 =C2=A0 *Safety levers:* a global max TTL, optional allow/deny li=
sts, and<br>
&gt;=C2=A0 =C2=A0 metrics (e.g., =E2=80=9Creads forced to primary due to re=
cency=E2=80=9D) for visibility.<br>
<br>
Please elaborate more on this. Allow/deny what?<br></blockquote><div><br></=
div><div>We can add &quot;table list&quot; that would ignore the feature, o=
r in reverse as an allow list that would enable it only for specific tables=
. I don&#39;t think that&#39;s needed, especially not for V1.=C2=A0</div><d=
iv><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px =
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
&gt;=C2=A0 =C2=A0 *Defaults &amp; compatibility:* all defaults are safe/off=
; enabling requires<br>
&gt;=C2=A0 =C2=A0 explicit opt-in.<br>
<br>
Sounds good.<br>
<br>
&gt; I=E2=80=99ll prepare the code changes and send a patch/PR, but before =
diving in I<br>
&gt; wanted to check if anyone has *objections, concerns, or preferred<br>
&gt; alternatives*=E2=80=95particularly around parser hooks, shared memory =
use, or<br>
&gt; hot-reload mechanics in pgpool-II.<br>
<br>
Probably you should consider adding a pcp command to notice pgpool the<br>
&quot;recent_access_ttl_ms&quot;. That is far more efficient than reloading=
<br>
pgpool.conf.</blockquote><div><br></div><div>Great idea! i wasn&#39;t aware=
 of the mechanism to be honest.</div><div><br></div><div>lastly another not=
e that came up - we can disable the feature and load balancing in case that=
 we have to evict old items in the map. or have it configurable how to beha=
ve in such a scenario.</div><div>=C2=A0</div><blockquote class=3D"gmail_quo=
te" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204=
);padding-left:1ex">=C2=A0</blockquote><blockquote class=3D"gmail_quote" st=
yle=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padd=
ing-left:1ex">
<br>
&gt; Thanks for considering,<br>
&gt; -- <br>
&gt; Nadav Shatz<br>
&gt; Tailor Brands | CTO<br>
<br>
Best regards,<br>
--<br>
Tatsuo Ishii<br>
SRA OSS K.K.<br>
English: <a href=3D"http://www.sraoss.co.jp/index_en/" rel=3D"noreferrer" t=
arget=3D"_blank">http://www.sraoss.co.jp/index_en/</a><br>
Japanese:<a href=3D"http://www.sraoss.co.jp" rel=3D"noreferrer" target=3D"_=
blank">http://www.sraoss.co.jp</a><br>
</blockquote></div><div><br clear=3D"all"></div><div>Best regards,</div><sp=
an class=3D"gmail_signature_prefix">-- </span><br><div dir=3D"rtl" class=3D=
"gmail_signature"><div dir=3D"ltr"><div><div dir=3D"ltr"><div dir=3D"rtl"><=
div dir=3D"ltr"><font color=3D"#000000">Nadav Shatz</font></div><div dir=3D=
"ltr"><font color=3D"#000000">Tailor Brands=C2=A0| CTO</font></div></div></=
div></div></div></div></div>

--0000000000005df85f063ca45274--