Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uo0ad-00Edyt-M8 for pgpool-hackers@arkaria.postgresql.org; Mon, 18 Aug 2025 14:12:01 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1uo0ad-00H2dK-4X for pgpool-hackers@arkaria.postgresql.org; Mon, 18 Aug 2025 14:11:59 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uo0ac-00H2dC-Rw for pgpool-hackers@lists.postgresql.org; Mon, 18 Aug 2025 14:11:59 +0000 Received: from mail-yw1-x1130.google.com ([2607:f8b0:4864:20::1130]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1uo0aa-000WqG-1o for pgpool-hackers@lists.postgresql.org; Mon, 18 Aug 2025 14:11:58 +0000 Received: by mail-yw1-x1130.google.com with SMTP id 00721157ae682-71d603b62adso35126677b3.1 for ; Mon, 18 Aug 2025 07:11:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=tailorbrands.com; s=google; t=1755526315; x=1756131115; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=ToPMWJzf1v0VmpcO9X8qLKBz6TzuQwzrt3oJPcjdrys=; b=TOZV0jF6rtFzOOIEX4MPKxSoa+ZEuiusgb/SUMPkj6oYaLW4WSNFzAYl/Ij92dt5+M /zWGXR/QtdULPDDO0+3HOpPWhCek6gXRe44PIlq4MOB6+5IR+qZkjP2r7ntvpX6C/X4G AWl14CZdT5s5AtIBQ/eAo1KCjD8/v93GAYZ90XfcYKiVm71DGrsVQZoBTUygyIKXPK4u PDjmbHiBlc/kmB+V95PDh7/I/k9GW9lK+2nJOTaH13YOlDZY2kD5JiIwPM7gw0gJsFnp ByTH8KlzdvQWGoYIBcIT5racXd5hvvI/WwSkT6II1elcwCMaNiisZIufgnwaRlkduLxP i6Kg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1755526315; x=1756131115; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=ToPMWJzf1v0VmpcO9X8qLKBz6TzuQwzrt3oJPcjdrys=; b=cWLVqZ/ZK/pL3dMY7S9yOUnmJ666xfInUp1qGUaESBQPTV+1Iv+2dNuZTWyFEBKDNu rcYH3PVKFWQE9BnyLfsYtbpw4u6RdY67Lk5zFZ2AbyMi4dQIv+4ke9WmH9gRY9abjvpq kRagj7ukxR+5UaAjwyK5ezhjLxkcyMhFAvlr9IbZGe0uJ9QYl8ojDCgwmTzPV+Dm08Yp waKjR2kEW6re7CRIobNvlHJhEKxNTOjJ4UydWO2FhZWeS7nNOVGFTF5bXGveVeSu8mKg PRoscVFfZK3ZDseyzEbQmRdRRlusJ6X/VSMfHNGgy8EEjo0bKKxQyZqOQfCc0Aneuc73 cLnw== X-Gm-Message-State: AOJu0YydH7Smffk11JwOu9tQ/kkw1x8HOCq1lGrK3kkVrva4tI1NVqZ8 DEbQrH1oJk6B7FKO98BWjagOq0Ruq+3410nfTJJPiNS485Xc7G3WPet4+uYF5bj/AjhK0kNIMQ1 gsVs6c4Md9BUFG9dCc2Ep0o0y3DZGVN4jGGUGJCvv+54OFKo145weWzmNmBdN X-Gm-Gg: ASbGnctmRHhT9lVYcDoxBp11rTsJ3fqXy5YBt+ZnBeOaTO5ckT035HOeHC3ZBgI+06A l6i61yXtkIw6fzOP3VDuG1reUlx9wQGYVj/gwl3/ZosYkYzUXb1iX32bRkoqe2JXsoWhEt05rJl 0YNfDExJY0LqvDsObYW/EU/8m0tL+3u/3SCItvtjLhI41e+hkUAM3VX/6c894ed1dPqzu/7MNsD dQnKig+caQr2ofn X-Google-Smtp-Source: AGHT+IH7OxfjL7Hcnamskti7nu86OuNTfL6T5qDns+nW4Fgwa06BkAcYE+v/c594gulNfcXT+Tc7SVM0yNEjNrvLsx0= X-Received: by 2002:a05:690c:4984:b0:71a:17fa:bf07 with SMTP id 00721157ae682-71e7751ebabmr109776077b3.40.1755526314445; Mon, 18 Aug 2025 07:11:54 -0700 (PDT) MIME-Version: 1.0 References: <20250818.215106.1325564662459771705.ishii@postgresql.org> In-Reply-To: <20250818.215106.1325564662459771705.ishii@postgresql.org> From: Nadav Shatz Date: Mon, 18 Aug 2025 17:11:42 +0300 X-Gm-Features: Ac12FXzNtgq4ngERfb219FU-NUHH1KtPBWmBi1-maKV-qlhHL82IE2iF5siNrD0 Message-ID: Subject: Re: Proposal: recent access based routing for primary-replica setups To: Tatsuo Ishii Cc: pgpool-hackers@lists.postgresql.org Content-Type: multipart/alternative; boundary="0000000000005df85f063ca45274" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000005df85f063ca45274 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Tatsuo, Thank you very much for your reply and questions! I'll try and respond to everything inline, please let me know if I missed something or if anything isn't clear enough. On Mon, Aug 18, 2025 at 3:51=E2=80=AFPM Tatsuo Ishii = wrote: > Hello Nadav, > > Thank you for the proposal. I have a few questions. > > > Hello all, > > > > My name is Nadav Shatz, I=E2=80=99m the CTO at Tailor Brands and have b= een > working > > with PostgreSQL in high-traffic, distributed environments for many year= s. > > Most of my focus has been on backend architecture, scaling, and > performance > > optimization, and I=E2=80=99m a long-time user and admirer of the Postg= res > > ecosystem. > > > > I=E2=80=99d like to propose adding a feature to pgpool-II for *recent a= ccess > based > > routing* in primary-replica setups. The idea is similar to what we=E2= =80=99ve > > described in this article > > < > https://medium.com/tailor-tech/using-database-read-replicas-in-distribute= d-systems-d80eaf6bbf8a > >, > > and is also reflected in this pgcat PR > > . The core concept is to > > route read queries to the primary if they occur shortly after a write, > > reducing replica lag inconsistencies while still benefiting from read > > scaling. > > > > *How it would work (high-level):* > > > > > > - > > > > > > *External =E2=80=9Ceffective lag=E2=80=9D via config (hot-reloaded): *I= nstead of relying > on > > pgpool-II=E2=80=99s replication delay checks (which don=E2=80=99t ma= p well to Aurora > > semantics), we=E2=80=99ll expose a *config value* representing the e= ffective > > replica lag (or directly the TTL to use for =E2=80=9Crecency=E2=80= =9D). This value > > is *pushed > > by an external controller* and *hot-reloaded* (no restarts). The > > relevant knobs might look like: > > > > - > > > > enable_recent_access_routing (boolean, default off) > > - > > > > recent_access_ttl_ms (integer, default 0, can be hot-reloaded) > > If my understanding is correct, the "external controller" updates > "recent_access_ttl_ms" to let pgpool know the current delay of > replica. My question is, what if there are multiple replicas. In this > case the "external controller" calculates the average latency of each > replica? > > Another question is, how often the external controller updates and > reload pgpool.conf. If it's like every second, probably it could give > unacceptable load to pgpool because reloading pgpool.conf is expensive > operation. > > You understood correctly - my plan was to keep it as generic as possible and leave all logic to be handled by the external controller. Basically leaving all of these decisions (how often to update, calculation, etc.) to the external implementation as it can get very case specific. This approach comes from the need of replica lag understanding under AWS Aurora - which doesn't expose these metrics from the DB itself. I also thought of implementing a couple of other possible mechanisms: 1. use a pcp command like you suggest below, i wasn't aware of the option, this will handle the expensive operation but no other concerns mentioned. 2. we can implement support to using the AWS Aurora API directly for the lag, while this is cloud provider and db "flavor" specific, it is a very large and common use case. Doing this will open up all other pgpool features that rely on the lag values being available. From a performance perspective it is probably best. > enable_query_parser (boolean, required for this feature, default > off) > > What does this do? Why do you need this? > this was referring to enabling the auto routing already existing in pgpool (based on query content), the naming is wrong. basically meant to say - if the auto routing is disabled, there is no point in enabling the latest access based routing. Sorry for the confusion. > > > *In-memory recent-access map: *Each worker maintains a lightweight per-= DB > > in-memory map of *recently written relations*. On any write > > Is "per-DB in-memory map" in shared memory? > Yes > > > (INSERT/UPDATE/DELETE/UPSERT/TRUNCATE), we record the touched > relations > > with a TTL derived from recent_access_ttl_ms. Entries expire > > automatically; writes refresh them. > > How do you automatically expire the entries? Are you going to > implement something like a auto sweeper process? > Great question - maybe combine that with a lazy deletion process on read. similar to what memcached is doing. > > > *Routing + query parsing: *For incoming statements we parse enough to > > answer two questions: (1) is it a read or a write? and (2) which > relations > > are referenced? If a read touches any =E2=80=9Crecently written=E2= =80=9D relation, we > *force > > route to primary*; otherwise we allow normal read load-balancing to > > replicas. > > Pgpool-II already does (1) and (2). > 1 - of course, i'm trying to build on top of it. 2 - maybe i'm not understanding the existing documentation correctly - but i couldn't find something that takes the specific relations (tables) under consideration, only query type (Read/Write) or passing the delay_threshold. Our approach here basically accepts no delay for these specific relations. so you get guaranteed data freshness at the expense of checking the specific table. it's a different kind of tradeoff. the whole approach can be expanded to take further "generic values" under considerations if needed to also take "tenant" id for instance under consideration. tho for those cases, using a table per tenant already solves that. Please let me know what i might be missing here. > > > *Notes on behavior & ops:* > > > > > > - > > > > *Config & hot reload:* Operators (or an external controller) can > update > > recent_access_ttl_ms dynamically and trigger hot reload to adapt to > > changing conditions=E2=80=95no reliance on Aurora internals. > > - > > > > *Safety levers:* a global max TTL, optional allow/deny lists, and > > metrics (e.g., =E2=80=9Creads forced to primary due to recency=E2=80= =9D) for > visibility. > > Please elaborate more on this. Allow/deny what? > We can add "table list" that would ignore the feature, or in reverse as an allow list that would enable it only for specific tables. I don't think that's needed, especially not for V1. > > *Defaults & compatibility:* all defaults are safe/off; enabling > requires > > explicit opt-in. > > Sounds good. > > > I=E2=80=99ll prepare the code changes and send a patch/PR, but before d= iving in I > > wanted to check if anyone has *objections, concerns, or preferred > > alternatives*=E2=80=95particularly around parser hooks, shared memory u= se, or > > hot-reload mechanics in pgpool-II. > > Probably you should consider adding a pcp command to notice pgpool the > "recent_access_ttl_ms". That is far more efficient than reloading > pgpool.conf. Great idea! i wasn't aware of the mechanism to be honest. lastly another note that came up - we can disable the feature and load balancing in case that we have to evict old items in the map. or have it configurable how to behave in such a scenario. > > > Thanks for considering, > > -- > > Nadav Shatz > > Tailor Brands | CTO > > Best regards, > -- > Tatsuo Ishii > SRA OSS K.K. > English: http://www.sraoss.co.jp/index_en/ > Japanese:http://www.sraoss.co.jp > Best regards, --=20 Nadav Shatz Tailor Brands | CTO --0000000000005df85f063ca45274 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Tatsuo,

Thank you v= ery much for your reply and questions!
I'll try and respond t= o everything inline, please let me know if I missed something or if anythin= g isn't clear enough.

On Mon, Aug 18, 2025 a= t 3:51=E2=80=AFPM Tatsuo Ishii <= ishii@postgresql.org> wrote:
Hello Nadav,

Thank you for the proposal. I have a few questions.

> Hello all,
>
> My name is Nadav Shatz, I=E2=80=99m the CTO at Tailor Brands and have = been working
> with PostgreSQL in high-traffic, distributed environments for many yea= rs.
> Most of my focus has been on backend architecture, scaling, and perfor= mance
> optimization, and I=E2=80=99m a long-time user and admirer of the Post= gres
> ecosystem.
>
> I=E2=80=99d like to propose adding a feature to pgpool-II for *recent = access based
> routing* in primary-replica setups. The idea is similar to what we=E2= =80=99ve
> described in this article
> <https://medium.com/tailor-tech/using-database-read-replicas-in-distribu= ted-systems-d80eaf6bbf8a>,
> and is also reflected in this pgcat PR
> <https://github.com/postgresml/pgcat/pull/864>. The core concept is to
> route read queries to the primary if they occur shortly after a write,=
> reducing replica lag inconsistencies while still benefiting from read<= br> > scaling.
>
> *How it would work (high-level):*
>
>
>=C2=A0 =C2=A0 -
>
>
> *External =E2=80=9Ceffective lag=E2=80=9D via config (hot-reloaded): *= Instead of relying on
>=C2=A0 =C2=A0 pgpool-II=E2=80=99s replication delay checks (which don= =E2=80=99t map well to Aurora
>=C2=A0 =C2=A0 semantics), we=E2=80=99ll expose a *config value* represe= nting the effective
>=C2=A0 =C2=A0 replica lag (or directly the TTL to use for =E2=80=9Crece= ncy=E2=80=9D). This value
> is *pushed
>=C2=A0 =C2=A0 by an external controller* and *hot-reloaded* (no restart= s). The
>=C2=A0 =C2=A0 relevant knobs might look like:
>
>=C2=A0 =C2=A0 -
>
>=C2=A0 =C2=A0 =C2=A0 =C2=A0enable_recent_access_routing (boolean, defau= lt off)
>=C2=A0 =C2=A0 =C2=A0 =C2=A0-
>
>=C2=A0 =C2=A0 =C2=A0 =C2=A0recent_access_ttl_ms (integer, default 0, ca= n be hot-reloaded)

If my understanding is correct, the "external controller" updates=
"recent_access_ttl_ms" to let pgpool know the current delay of replica. My question is, what if there are multiple replicas. In this
case the "external controller" calculates the average latency of = each
replica?

Another question is, how often the external controller updates and
reload pgpool.conf. If it's like every second, probably it could give unacceptable load to pgpool because reloading pgpool.conf is expensive
operation.



>=C2=A0 =C2=A0 =C2=A0 =C2=A0enable_query_parser (boolean, required for t= his feature, default off)

What does this do? Why do you need this?

this was referring to enabling the auto routing already existing in pgpoo= l (based on query content), the naming is wrong.=C2=A0
Sorry for the confusion.
=C2=A0

> *In-memory recent-access map: *Each worker maintains a lightweight per= -DB
>=C2=A0 =C2=A0 in-memory map of *recently written relations*. On any wri= te

Is "per-DB in-memory map" in shared memory?
=
Yes
=C2=A0

>=C2=A0 =C2=A0 (INSERT/UPDATE/DELETE/UPSERT/TRUNCATE), we record the tou= ched relations
>=C2=A0 =C2=A0 with a TTL derived from recent_access_ttl_ms. Entries exp= ire
>=C2=A0 =C2=A0 automatically; writes refresh them.

How do you automatically expire the entries? Are you going to
implement something like a auto sweeper process?

<= /div>
Great question - maybe combine that with a lazy deletion process = on read. similar to what memcached is doing.
=C2=A0

> *Routing + query parsing: *For incoming statements we parse enough to<= br> >=C2=A0 =C2=A0 answer two questions: (1) is it a read or a write? and (2= ) which relations
>=C2=A0 =C2=A0 are referenced? If a read touches any =E2=80=9Crecently w= ritten=E2=80=9D relation, we *force
>=C2=A0 =C2=A0 route to primary*; otherwise we allow normal read load-ba= lancing to
>=C2=A0 =C2=A0 replicas.

Pgpool-II already does (1) and (2).

1 -= of course, i'm trying to build on top of it.
2 - maybe i'= ;m not understanding the existing documentation correctly - but i couldn= 9;t find something that takes the specific relations (tables) under conside= ration, only query type (Read/Write) or passing the delay_threshold.
<= div>Our approach here basically accepts no delay for these specific relatio= ns. so you get guaranteed data freshness at the expense of checking the spe= cific table. it's a different kind of tradeoff.
the whole app= roach can be expanded to take further "generic values" under cons= iderations if needed to also take "tenant" id for instance under = consideration. tho for those cases, using a table per tenant already solves= that.

Please let me know what i might be missing = here.=C2=A0
=C2=A0

> *Notes on behavior & ops:*
>
>
>=C2=A0 =C2=A0 -
>
>=C2=A0 =C2=A0 *Config & hot reload:* Operators (or an external cont= roller) can update
>=C2=A0 =C2=A0 recent_access_ttl_ms dynamically and trigger hot reload t= o adapt to
>=C2=A0 =C2=A0 changing conditions=E2=80=95no reliance on Aurora interna= ls.
>=C2=A0 =C2=A0 -
>
>=C2=A0 =C2=A0 *Safety levers:* a global max TTL, optional allow/deny li= sts, and
>=C2=A0 =C2=A0 metrics (e.g., =E2=80=9Creads forced to primary due to re= cency=E2=80=9D) for visibility.

Please elaborate more on this. Allow/deny what?

We can add "table list" that would ignore the feature, o= r in reverse as an allow list that would enable it only for specific tables= . I don't think that's needed, especially not for V1.=C2=A0


>=C2=A0 =C2=A0 *Defaults & compatibility:* all defaults are safe/off= ; enabling requires
>=C2=A0 =C2=A0 explicit opt-in.

Sounds good.

> I=E2=80=99ll prepare the code changes and send a patch/PR, but before = diving in I
> wanted to check if anyone has *objections, concerns, or preferred
> alternatives*=E2=80=95particularly around parser hooks, shared memory = use, or
> hot-reload mechanics in pgpool-II.

Probably you should consider adding a pcp command to notice pgpool the
"recent_access_ttl_ms". That is far more efficient than reloading=
pgpool.conf.

Great idea! i wasn't aware= of the mechanism to be honest.

lastly another not= e that came up - we can disable the feature and load balancing in case that= we have to evict old items in the map. or have it configurable how to beha= ve in such a scenario.
=C2=A0
=C2=A0

> Thanks for considering,
> --
> Nadav Shatz
> Tailor Brands | CTO

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English:
http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Best regards,
--
<= div dir=3D"ltr">Nadav Shatz
Tailor Brands=C2=A0| CTO
--0000000000005df85f063ca45274--