MIME-Version: 1.0
References: 
 <CA+HiwqF4C0ws3cO+z5cLkPuvwnAwkSp7sfvgGj3yQ=Li6KNMqA@mail.gmail.com>
 <CA+HiwqGM6nvAV5O+=Nr+BXMPWOma0oeCRVzVP0XiLE8zX5TVAg@mail.gmail.com>
 <CA+HiwqGMaovCUgDbGxVGnK0Mrivr+ph3YE2Ws+47-ugyPb4f7g@mail.gmail.com>
 <CAFj8pRDaiBe_GOLk_yyYHTtPiDAAaLOM8u1-=Q3ZgXBTH+1igg@mail.gmail.com>
 <CA+HiwqGA5Ay_MR0eJEEbt4j6WrVh4F+AasTp8yCbs5aJLOJn6Q@mail.gmail.com>
 <CAEG8a3JM=NoqiTK0V6S9FNxZPvy1+C5F7rfafTtPKBVWnunL-g@mail.gmail.com>
 <CA+HiwqEyiLCY6MTLbOJXDdLNNQLaURYHvdW797MQgbjEK9od4Q@mail.gmail.com>
 <CAEG8a3+VBpwPf1Rm-ECD90whM9b3YnGhux5CVXdsL6khiBfzRQ@mail.gmail.com>
 <CA+HiwqF2UHzF0sKCp-F2a-U29rqh_9ZPy=f1h+Fh_=M8efj3pg@mail.gmail.com>
In-Reply-To: 
 <CA+HiwqF2UHzF0sKCp-F2a-U29rqh_9ZPy=f1h+Fh_=M8efj3pg@mail.gmail.com>
From: Junwang Zhao <zhjwpku@gmail.com>
Date: Mon, 2 Mar 2026 23:30:58 +0800
Message-ID: 
 <CAEG8a3L9Ew-WL8sxLROVOcypeaENPmd8qCmMvz4geoGL1TDGCA@mail.gmail.com>
Subject: Re: Eliminating SPI / SQL from some RI triggers - take 3
To: Amit Langote <amitlangote09@gmail.com>
Cc: Pavel Stehule <pavel.stehule@gmail.com>,
	PostgreSQL-development <pgsql-hackers@postgresql.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: 
 <https://www.postgresql.org/message-id/CAEG8a3L9Ew-WL8sxLROVOcypeaENPmd8qCmMvz4geoGL1TDGCA%40mail.gmail.com>
Precedence: bulk

On Sat, Feb 28, 2026 at 3:08=E2=80=AFPM Amit Langote <amitlangote09@gmail.c=
om> wrote:
>
> Hi Junwang,
>
> On Mon, Feb 23, 2026 at 10:45=E2=80=AFPM Junwang Zhao <zhjwpku@gmail.com>=
 wrote:
> > On Thu, Feb 19, 2026 at 5:21=E2=80=AFPM Amit Langote <amitlangote09@gma=
il.com> wrote:
> > > I re-ran the benchmarks (same test as yours, different machine):
> > >
> > > create table pk (a numeric primary key);
> > > create table fk (a bigint references pk);
> > > insert into pk select generate_series(1, 2000000);
> > > insert into fk select generate_series(1, 2000000, 2);
> > >
> > > master: 2444 ms  (median of 3 runs)
> > > 0001: 1382 ms  (43% faster)
> > > 0001+0002: 1202 ms  (51% faster, 13% over 0001 alone)
> >
> > I can get similar improvement on my old mac intel chip:
> >
> > master: 12963.993 ms
> > 0001: 6641.692 ms, 48.8% faster
> > 0001+0002: 5771.703 ms, 55.5% faster
> > >
> > > Also, with int PK / int FK (1M rows):
> > >
> > > create table pk (a int primary key);
> > > create table fk (a int references pk);
> > > insert into pk select generate_series(1, 1000000);
> > > insert into fk select generate_series(1, 1000000);
> > >
> > > master: 1000 ms
> > > 0001: 520 ms  (48% faster)
> > > 0001+0002: 432 ms  (57% faster, 17% over 0001 alone)
> >
> > master: 11134.583 ms
> > 0001: 5240.298 ms, 52.9% faster
> > 0001+0002: 4554.215 ms, 59.1% faster
>
> Thanks for testing, good to see similar numbers.  I had forgotten to
> note that these results are when these PK index probes don't do any
> I/O, though you might be aware of that. Below, I report some numbers
> that Tomas Vondra shared with me off-list where the probes do have to
> perform I/O and there the benefits from only this patch set are only
> marginal.
>
> > I don't have any additional comments on the patch except one minor nit,
> > maybe merge the following two if conditions into one, not a strong opin=
ion
> > though.
> >
> > if (use_cache)
> > {
> > /*
> > * The snapshot was registered once when the cache entry was created.
> > * We just patch curcid to reflect the new command counter.
> > * SnapshotSetCommandId() only patches process-global statics, not
> > * registered copies, so we do it directly.
> > *
> > * The xmin/xmax/xip fields don't need refreshing: within a single
> > * statement batch, only curcid changes between rows.
> > */
> > Assert(fpentry && fpentry->snapshot !=3D NULL);
> > snapshot =3D fpentry->snapshot;
> > snapshot->curcid =3D GetCurrentCommandId(false);
> > }
> > else
> > snapshot =3D RegisterSnapshot(GetLatestSnapshot());
> >
> > if (use_cache)
> > {
> > pk_rel =3D fpentry->pk_rel;
> > idx_rel =3D fpentry->idx_rel;
> > scandesc =3D fpentry->scandesc;
> > slot =3D fpentry->slot;
> > }
> > else
> > {
> > pk_rel =3D table_open(riinfo->pk_relid, RowShareLock);
> > idx_rel =3D index_open(riinfo->conindid, AccessShareLock);
> > scandesc =3D index_beginscan(pk_rel, idx_rel,
> > snapshot, NULL,
> > riinfo->nkeys, 0);
> > slot =3D table_slot_create(pk_rel, NULL);
> > }
>
> Good idea, done.
>
> While polishing 0002, I revisited the snapshot caching semantics. The
> previous commit message hand-waved about only curcid changing between
> rows, but GetLatestSnapshot() also reflects other backends' commits,
> so reusing the snapshot is a deliberate semantic change from the SPI
> path. I think it's safe because curcid is all we need for
> intra-statement visibility, concurrent commits either already happened
> before our snapshot (and are visible) or are racing with our statement
> and wouldn't be seen reliably even with per-row snapshots since the
> order in which FK rows are checked is nondeterministic, and
> LockTupleKeyShare prevents the PK row from disappearing regardless. In
> essence, we're treating all the FK checks within a trigger-firing
> cycle as a single plan execution that happens to scan N rows, rather
> than N independent SPI queries each taking a fresh snapshot. That's
> the natural model -- a normal SELECT ... FOR KEY SHARE plan doesn't
> re-take GetLatestSnapshot() between rows either.
>
> Similarly, the permission check (schema USAGE + table SELECT) is now
> done once at cache entry creation in ri_FastPathGetEntry() rather than
> on every flush.

nice improvement.

> The RI check runs as the PK table owner, so we're
> verifying that the owner can access their own table -- a condition
> that won't change unless someone explicitly revokes from the owner,
> which would also break the SPI path.
>
> > > David Rowley mentioned off-list that it might be worth batching
> > > multiple FK values into a single index probe, leveraging the
> > > ScalarArrayOp btree improvements from PostgreSQL 17. The idea would b=
e
> > > to buffer FK values across trigger invocations in the per-constraint
> > > cache (0002 already has the right structure for this), build a
> > > SK_SEARCHARRAY scan key, and let the btree AM walk the matching leaf
> > > pages in one sorted traversal instead of one tree descent per row. Th=
e
> > > locking and recheck would still be per-tuple, but the index traversal
> > > cost drops significantly. Single-column FKs are the obvious starting
> > > point. That seems worth exploring but can be done as a separate patch
> > > on top of this.
> >
> > I will take a look at this in the following weeks.
>
> I ended up going ahead with the batching and SAOP idea that David
> mentioned -- I had a proof-of-concept working shortly after posting v3
> and kept iterating on it.  So attached set is now:
>
> 0001 - Core fast path (your 0001+0002 reworked, as before)
>
> 0002 - Per-batch resource caching (PK relation, index, scandesc, snapshot=
)
>
> 0003 - FK row buffering: materialize FK tuples into a per-constraint
> batch buffer (64 rows), flush when full or at batch end
>
> 0004 - SK_SEARCHARRAY for single-column FKs: build an array from the
> buffered FK values and do one index scan instead of 64 separate tree
> descents.  Multi-column FKs fall back to a per-row loop.
>
> 0003 is pure infrastructure -- it doesn't improve performance on its
> own because the per-row index descent still dominates.  The payoff
> comes in 0004.
>
> Numbers (same machine as before, median of 3 runs):
>
> numeric PK / bigint FK, 1M rows:
> master: 2487 ms
> 0001..0004: 1168 ms  (2.1x)
>
> int PK / int FK, 500K rows:
> master: 1043 ms
> 0001..0004: 335 ms  (3.1x)
>
> The int/int case benefits most because the per-row cost is lower, so
> the SAOP traversal savings are a larger fraction of the total. The
> numeric/bigint case still sees a solid improvement despite the
> cross-type cast overhead.
>
> Tomas Vondra also tested with an I/O-intensive workload (dataset
> larger than shared_buffers, combined with his and Peter Geoghegan's
> I/O prefetching patches) and confirmed that the batching + SAOP
> approach helps there too, not just in the CPU-bound / memory-resident
> case.  In fact he showed that the patches here don't make a big dent
> when the main bottleneck is I/O as shown in numbers that he shared in
> an off-list email:
>
> master: 161617 ms
> ri-check (0001..0004): 149446 ms  (1.08x)
> ri-check + i/o prefetching: 50885 ms  (3.2x)
>
> So the RI patches alone only give ~8% here since most time is waiting
> on reads.  But the batching gives the prefetch machinery a window of
> upcoming probes to issue readahead against, so the two together yield
> 3.2x.

impressive!

>
> Tomas also caught a memory context bug in the batch flush path: the
> cached scandesc lives in TopTransactionContext, but the btree AM
> defers _bt_preprocess_keys allocation to the first getnext call, which
> pallocs into CurrentMemoryContext. If that's a short-lived
> per-trigger-row context, the scandesc has dangling pointers on the
> next rescan. Fixed by switching to TopTransactionContext before the
> probe loop.
>
> Finally, I've fixed a number of other small and not-so-small bugs
> found while polishing the old patches and made other stylistic
> improvements. One notable change is that I introduced a FastPathMeta

Yeah, this is much better than the fpmeta_valid field.

> struct to store the fast path metadata instead of dumping those arrays
> in the RI_ConstraintInfo. It's allocated lazily on first use and holds
> the per-key compare entries, operator procedures, and index strategy
> info needed by the scan key construction, so RI_ConstraintInfo doesn't
> pay for them when the fast path isn't used.
>
>
> On Mon, Feb 23, 2026 at 10:45=E2=80=AFPM Junwang Zhao <zhjwpku@gmail.com>=
 wrote:
> >
> > Hi Amit,
> >
> > On Thu, Feb 19, 2026 at 5:21=E2=80=AFPM Amit Langote <amitlangote09@gma=
il.com> wrote:
> > >
> > > Hi Junwang,
> > >
> > > On Mon, Dec 1, 2025 at 3:09=E2=80=AFPM Junwang Zhao <zhjwpku@gmail.co=
m> wrote:
> > > > As Amit has already stated, we are approaching a hybrid "fast-path =
+ fallback"
> > > > design.
> > > >
> > > > 0001 adds a fast path optimization for foreign key constraint check=
s
> > > > that bypasses the SPI executor, the fast path applies when the refe=
renced
> > > > table is not partitioned, and the constraint does not involve tempo=
ral
> > > > semantics.
> > > >
> > > > With the following test:
> > > >
> > > > create table pk (a numeric primary key);
> > > > create table fk (a bigint references pk);
> > > > insert into pk select generate_series(1, 2000000);
> > > >
> > > > head:
> > > >
> > > > [local] zhjwpku@postgres:5432-90419=3D# insert into fk select
> > > > generate_series(1, 2000000, 2);
> > > > INSERT 0 1000000
> > > > Time: 13516.177 ms (00:13.516)
> > > >
> > > > [local] zhjwpku@postgres:5432-90419=3D# update fk set a =3D a + 1;
> > > > UPDATE 1000000
> > > > Time: 15057.638 ms (00:15.058)
> > > >
> > > > patched:
> > > >
> > > > [local] zhjwpku@postgres:5432-98673=3D# insert into fk select
> > > > generate_series(1, 2000000, 2);
> > > > INSERT 0 1000000
> > > > Time: 8248.777 ms (00:08.249)
> > > >
> > > > [local] zhjwpku@postgres:5432-98673=3D# update fk set a =3D a + 1;
> > > > UPDATE 1000000
> > > > Time: 10117.002 ms (00:10.117)
> > > >
> > > > 0002 cache fast-path metadata used by the index probe, at the curre=
nt
> > > > time only comparison operator hash entries, operator function OIDs
> > > > and strategy numbers and subtypes for index scans. But this cache
> > > > doesn't buy any performance improvement.
> > > >
> > > > Caching additional metadata should improve performance for foreign =
key checks.
> > > >
> > > > Amit suggested introducing a mechanism for ri_triggers.c to registe=
r a
> > > > cleanup callback in the EState, which AfterTriggerEndQuery() could =
then
> > > > invoke to release per-statement cached metadata (such as the IndexS=
canDesc).
> > > > However, I haven't been able to implement this mechanism yet.
> > >
> > > Thanks for working on this.  I've taken your patches as a starting
> > > point and reworked the series into two patches (attached): 1st is you=
r
> > > 0001+0002 as the core patch that adds a gated fast-path alternative t=
o
> > > SPI and 2nd where I added per-statement resource caching.  Doing the
> > > latter turned out to be not so hard thanks to the structure you chose
> > > to build the core fast path.  Good call on adding the RLS and ACL tes=
t
> > > cases, btw.
> > >
> > > So, 0001 is a functionally complete fast path: concurrency handling,
> > > REPEATABLE READ crosscheck, cross-type operators, security context,
> > > and metadata caching. 0002 implements the per-statement resource
> > > caching we discussed, though instead of sharing the EState between
> > > trigger.c and ri_triggers.c it uses a new AfterTriggerBatchCallback
> > > mechanism that fires at the end of each trigger-firing cycle
> > > (per-statement for immediate constraints, or until COMMIT for deferre=
d
> > > ones). It layers resource caching on top so that the PK relation,
> > > index, scan descriptor, and snapshot stay open across all FK trigger
> > > invocations within a single trigger-firing cycle rather than being
> > > opened and closed per row.
> > >
> > > Note that phe previous 0002 (metadata caching) is folded into 0001,
> > > and most of the new fast-path logic added in 0001 now lives in
> > > ri_FastPathCheck() rather than inline in RI_FKey_check(), so the
> > > RI_FKey_check diff is just the gating call and SPI fallback.
> > >
> > > I re-ran the benchmarks (same test as yours, different machine):
> > >
> > > create table pk (a numeric primary key);
> > > create table fk (a bigint references pk);
> > > insert into pk select generate_series(1, 2000000);
> > > insert into fk select generate_series(1, 2000000, 2);
> > >
> > > master: 2444 ms  (median of 3 runs)
> > > 0001: 1382 ms  (43% faster)
> > > 0001+0002: 1202 ms  (51% faster, 13% over 0001 alone)
> >
> > I can get similar improvement on my old mac intel chip:
> >
> > master: 12963.993 ms
> > 0001: 6641.692 ms, 48.8% faster
> > 0001+0002: 5771.703 ms, 55.5% faster
> >
> > >
> > > Also, with int PK / int FK (1M rows):
> > >
> > > create table pk (a int primary key);
> > > create table fk (a int references pk);
> > > insert into pk select generate_series(1, 1000000);
> > > insert into fk select generate_series(1, 1000000);
> > >
> > > master: 1000 ms
> > > 0001: 520 ms  (48% faster)
> > > 0001+0002: 432 ms  (57% faster, 17% over 0001 alone)
> >
> > master: 11134.583 ms
> > 0001: 5240.298 ms, 52.9% faster
> > 0001+0002: 4554.215 ms, 59.1% faster
> >
> > >
> > > The incremental gain from 0002 comes from eliminating per-row relatio=
n
> > > open/close, scan begin/end, slot alloc/free, and replacing per-row
> > > GetSnapshotData() with only curcid adjustment on the registered
> > > snapshot copy in the cache.
> > >
> > > The two current limitations are partitioned referenced tables and
> > > temporal foreign keys. Partitioned PKs are relatively uncommon in
> > > practice, so the non-partitioned case should cover most FK workloads,
> > > so I'm not sure it's worth the added complexity to support them.
> > > Temporal FKs are inherently multi-row, so they're a poor fit for a
> > > single-probe fast path.
> > >
> > > David Rowley mentioned off-list that it might be worth batching
> > > multiple FK values into a single index probe, leveraging the
> > > ScalarArrayOp btree improvements from PostgreSQL 17. The idea would b=
e
> > > to buffer FK values across trigger invocations in the per-constraint
> > > cache (0002 already has the right structure for this), build a
> > > SK_SEARCHARRAY scan key, and let the btree AM walk the matching leaf
> > > pages in one sorted traversal instead of one tree descent per row. Th=
e
> > > locking and recheck would still be per-tuple, but the index traversal
> > > cost drops significantly. Single-column FKs are the obvious starting
> > > point. That seems worth exploring but can be done as a separate patch
> > > on top of this.
> >
> > I will take a look at this in the following weeks.
> >
> > >
> > > I think the series is in reasonable shape but would appreciate extra
> > > eyeballs, especially on the concurrency handling in ri_LockPKTuple()
> > > in 0001 and the snapshot lifecycle in 0002. Or anything else that
> > > catches one's eye.
> > >
> > > --
> > > Thanks, Amit Langote
> >
> > I don't have any additional comments on the patch except one minor nit,
> > maybe merge the following two if conditions into one, not a strong opin=
ion
> > though.
> >
> > if (use_cache)
> > {
> > /*
> > * The snapshot was registered once when the cache entry was created.
> > * We just patch curcid to reflect the new command counter.
> > * SnapshotSetCommandId() only patches process-global statics, not
> > * registered copies, so we do it directly.
> > *
> > * The xmin/xmax/xip fields don't need refreshing: within a single
> > * statement batch, only curcid changes between rows.
> > */
> > Assert(fpentry && fpentry->snapshot !=3D NULL);
> > snapshot =3D fpentry->snapshot;
> > snapshot->curcid =3D GetCurrentCommandId(false);
> > }
> > else
> > snapshot =3D RegisterSnapshot(GetLatestSnapshot());
> >
> > if (use_cache)
> > {
> > pk_rel =3D fpentry->pk_rel;
> > idx_rel =3D fpentry->idx_rel;
> > scandesc =3D fpentry->scandesc;
> > slot =3D fpentry->slot;
> > }
> > else
> > {
> > pk_rel =3D table_open(riinfo->pk_relid, RowShareLock);
> > idx_rel =3D index_open(riinfo->conindid, AccessShareLock);
> > scandesc =3D index_beginscan(pk_rel, idx_rel,
> > snapshot, NULL,
> > riinfo->nkeys, 0);
> > slot =3D table_slot_create(pk_rel, NULL);
> > }
> >
> > --
> > Regards
> > Junwang Zhao
>
>
>
> --
> Thanks, Amit Langote


--=20
Regards
Junwang Zhao