Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vx5Fp-00EWN3-2f for pgsql-hackers@arkaria.postgresql.org; Mon, 02 Mar 2026 15:32:18 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vx5Eo-001boY-1d for pgsql-hackers@arkaria.postgresql.org; Mon, 02 Mar 2026 15:31:14 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vx5Eo-001bnu-0H for pgsql-hackers@lists.postgresql.org; Mon, 02 Mar 2026 15:31:14 +0000 Received: from mail-ej1-x62c.google.com ([2a00:1450:4864:20::62c]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1vx5Em-000000002DB-0E1d for pgsql-hackers@postgresql.org; Mon, 02 Mar 2026 15:31:14 +0000 Received: by mail-ej1-x62c.google.com with SMTP id a640c23a62f3a-b936505e7a0so101405766b.1 for ; Mon, 02 Mar 2026 07:31:11 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1772465471; cv=none; d=google.com; s=arc-20240605; b=Va5oFTuFZmriZXtnmA11Kgaj4jwz96Ti5UG16X0wTX20QaFtraAZ+1SRRziqPxr+9o xjf7HWIWuT48rCysBgbRIY4f/W366jTaxMzJQBcpIngnNS1RyHJVZMVst5Ph3xJvPkRA DTLKD+6ZRUJF2mLso1eDMReltI5+1OUK5WeJGIuBaY5jJpmvXrb0A4Vh73ZSA1NoBKVP +FXZwixpeWVmoY6iYAN49/v3XBSFCVfl47hjimY4pWoXxwqfKJZLYWyHryccv1il0ybo rBAr1PpqMQSutyYzLpoZK7ehnf2ZeofMscM/sJG3xAq/GFZqsQKD+ahwF+VntePrTQRb k/sQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=Ym26uHL7jYyXtnXzfAqSQvA42qYcvtA3eFinVMOVz/c=; fh=EtvjyVH5J6Ma+LVt4YqSCJB7KyCJ7DA6AcJZYF4RMH0=; b=RCzwgR0nBV/G+ZNijpKsG5F8SoFm7LfOW492X/uzN7zPxOFstMLLBxG5XgSfJwh/5d wbHYF0COgLuBlv3jyHEtuWB/mUoLzDzFGFhYIamc3819N0MKG+3CmEMEaYYv+/mkEasc ux+b33QthTcFETP61eZk+jwrzn2AsVS3251bummAdRjW1Xj6n4FkXkY5Lb+Y6OdiW6Mf CxPvQDeYIw9B2PbHHNQnyxLyeNhMWs6uXUggZzA+LCTn2vGqUD2KEC+0PXWmEEVxJu3I gNcFOMxXvkvCeQNiUWZPJZXoEQ7NgxxjpqTJBvRzAx2dkZRSDC6QW8i89KUKxan4cQ/I 96fQ==; darn=postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1772465471; x=1773070271; darn=postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Ym26uHL7jYyXtnXzfAqSQvA42qYcvtA3eFinVMOVz/c=; b=atU5zwr3LoMQQTiJk+8mrVOnI2ysqh8zx3mTOnhdcBwXPoD6KoQ7wsyO5I4p7eJzi5 1JIKGfmETmyfSxal92doawogTWWVRiAOaE5VRJypE1pDGX93bWHhTCnE02hKIAo0hhA3 YZcFJ0u5/lrrPCmWtZA3pKwoOp95j6qWEJ09Firf4mUqOBYbB8lXFb1OPu6JMi7TbgWR He/JX4EyM5GywKYUAFP2e5ftyy0sHiKQk41ULS/eEz8x3n+0IkDYCkeZZMeIblLLLSi1 0GKOPB7z6gw+oTbcpD5x7Z058Jyry/hLDJH6Sf6eqvTUdtoRS4yNwByvn8jdeepEqYXO EMaA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1772465471; x=1773070271; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=Ym26uHL7jYyXtnXzfAqSQvA42qYcvtA3eFinVMOVz/c=; b=jQF4KrBj+b2kzHv4anovTx1lnBVTtyhsVyu/0KjR7JvRYcjB4KCBHLnemQE6kzeAq5 f9bgbTE4gQYU7QqMsMCcrab8I5kNFhuyAbxN0Y43Zit1pRQir3TDO0ULwdFBUi4kUIbC VoO1EMqHqu/xRHey+M/QDk3sXwIwW1lRu1QwNBMNkVF/M9WtzNikZws5YCAp/K7t3seR UrxExDUh8odKGyJ/QdROH6kj/Q99O5ZcA7X2i31DpO/TCEstWYm2Yk73WJPt7mRHLRlt caYyQ7UHgsi7zORhDTnbAqHOtpXc8rRdKNIZbQaVSVyERGyLNzuAM983pzZEbTHcY+0Y GbXg== X-Forwarded-Encrypted: i=1; AJvYcCV49FoqXHsVVcdfq6swHho6lGbJQaYoY/yLxqZPlTJ9ZwkOaeXR04ChN3spabayXqydCSEz+XBZLfFueOz2@postgresql.org X-Gm-Message-State: AOJu0YypHuiSvKdiYVD2J4ZPCgYYuI8am8Vh/gJsFcoGbaJOwjCYsMnL vLCcFwib+HqZvCVQKMg4HxiN0hoFCe+zI2MYoE+Ca/cGUHb5uMAtmebh3UlIBdxl4VHDtGXDZGk BvLYE4++03Ixit1uL+Keb9Lw265vM5jA= X-Gm-Gg: ATEYQzzsYLYrE68wSjuTb1/CPm6htEAgFBWP7Aq83RS4dxNzB8aLmgoARJvS8bimTaA 7YWRHFnAgFukqQIpzZiSLmb8oZwE4syu1uQUJAX7H2MPchKd76JaCftBMQXMQ2dGyTIZJ7K1BCd dFrgefIVDFm8ZrSEMAw9KkVQOU/A7TT0CLpBvKDgwKlyYN/617pcafhc87ht2wTJvMEkga1C5wJ znWpL5sXf63gJPbmn4+plOtMPsQL3XfN33vk3Wmyzyirxfg7P8mdgk7C+RUVTGBtWoLkCaVvnqT Y6LppDpJhGbr6ODzgAc4Zj8dlUnv6t3nGmN9cXZ+J8j39pGfizWCi6hyAfFYB36F1YcyLQ== X-Received: by 2002:a17:906:ef06:b0:b90:b959:4c72 with SMTP id a640c23a62f3a-b9375810b39mr949387566b.6.1772465470493; Mon, 02 Mar 2026 07:31:10 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Junwang Zhao Date: Mon, 2 Mar 2026 23:30:58 +0800 X-Gm-Features: AaiRm50yrJSjkOopnPOmlsWYXZ2WGPfN4ONwm6EIHP0-KrS89PVg2TOzSDFIbdg Message-ID: Subject: Re: Eliminating SPI / SQL from some RI triggers - take 3 To: Amit Langote Cc: Pavel Stehule , PostgreSQL-development Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Sat, Feb 28, 2026 at 3:08=E2=80=AFPM Amit Langote wrote: > > Hi Junwang, > > On Mon, Feb 23, 2026 at 10:45=E2=80=AFPM Junwang Zhao = wrote: > > On Thu, Feb 19, 2026 at 5:21=E2=80=AFPM Amit Langote wrote: > > > I re-ran the benchmarks (same test as yours, different machine): > > > > > > create table pk (a numeric primary key); > > > create table fk (a bigint references pk); > > > insert into pk select generate_series(1, 2000000); > > > insert into fk select generate_series(1, 2000000, 2); > > > > > > master: 2444 ms (median of 3 runs) > > > 0001: 1382 ms (43% faster) > > > 0001+0002: 1202 ms (51% faster, 13% over 0001 alone) > > > > I can get similar improvement on my old mac intel chip: > > > > master: 12963.993 ms > > 0001: 6641.692 ms, 48.8% faster > > 0001+0002: 5771.703 ms, 55.5% faster > > > > > > Also, with int PK / int FK (1M rows): > > > > > > create table pk (a int primary key); > > > create table fk (a int references pk); > > > insert into pk select generate_series(1, 1000000); > > > insert into fk select generate_series(1, 1000000); > > > > > > master: 1000 ms > > > 0001: 520 ms (48% faster) > > > 0001+0002: 432 ms (57% faster, 17% over 0001 alone) > > > > master: 11134.583 ms > > 0001: 5240.298 ms, 52.9% faster > > 0001+0002: 4554.215 ms, 59.1% faster > > Thanks for testing, good to see similar numbers. I had forgotten to > note that these results are when these PK index probes don't do any > I/O, though you might be aware of that. Below, I report some numbers > that Tomas Vondra shared with me off-list where the probes do have to > perform I/O and there the benefits from only this patch set are only > marginal. > > > I don't have any additional comments on the patch except one minor nit, > > maybe merge the following two if conditions into one, not a strong opin= ion > > though. > > > > if (use_cache) > > { > > /* > > * The snapshot was registered once when the cache entry was created. > > * We just patch curcid to reflect the new command counter. > > * SnapshotSetCommandId() only patches process-global statics, not > > * registered copies, so we do it directly. > > * > > * The xmin/xmax/xip fields don't need refreshing: within a single > > * statement batch, only curcid changes between rows. > > */ > > Assert(fpentry && fpentry->snapshot !=3D NULL); > > snapshot =3D fpentry->snapshot; > > snapshot->curcid =3D GetCurrentCommandId(false); > > } > > else > > snapshot =3D RegisterSnapshot(GetLatestSnapshot()); > > > > if (use_cache) > > { > > pk_rel =3D fpentry->pk_rel; > > idx_rel =3D fpentry->idx_rel; > > scandesc =3D fpentry->scandesc; > > slot =3D fpentry->slot; > > } > > else > > { > > pk_rel =3D table_open(riinfo->pk_relid, RowShareLock); > > idx_rel =3D index_open(riinfo->conindid, AccessShareLock); > > scandesc =3D index_beginscan(pk_rel, idx_rel, > > snapshot, NULL, > > riinfo->nkeys, 0); > > slot =3D table_slot_create(pk_rel, NULL); > > } > > Good idea, done. > > While polishing 0002, I revisited the snapshot caching semantics. The > previous commit message hand-waved about only curcid changing between > rows, but GetLatestSnapshot() also reflects other backends' commits, > so reusing the snapshot is a deliberate semantic change from the SPI > path. I think it's safe because curcid is all we need for > intra-statement visibility, concurrent commits either already happened > before our snapshot (and are visible) or are racing with our statement > and wouldn't be seen reliably even with per-row snapshots since the > order in which FK rows are checked is nondeterministic, and > LockTupleKeyShare prevents the PK row from disappearing regardless. In > essence, we're treating all the FK checks within a trigger-firing > cycle as a single plan execution that happens to scan N rows, rather > than N independent SPI queries each taking a fresh snapshot. That's > the natural model -- a normal SELECT ... FOR KEY SHARE plan doesn't > re-take GetLatestSnapshot() between rows either. > > Similarly, the permission check (schema USAGE + table SELECT) is now > done once at cache entry creation in ri_FastPathGetEntry() rather than > on every flush. nice improvement. > The RI check runs as the PK table owner, so we're > verifying that the owner can access their own table -- a condition > that won't change unless someone explicitly revokes from the owner, > which would also break the SPI path. > > > > David Rowley mentioned off-list that it might be worth batching > > > multiple FK values into a single index probe, leveraging the > > > ScalarArrayOp btree improvements from PostgreSQL 17. The idea would b= e > > > to buffer FK values across trigger invocations in the per-constraint > > > cache (0002 already has the right structure for this), build a > > > SK_SEARCHARRAY scan key, and let the btree AM walk the matching leaf > > > pages in one sorted traversal instead of one tree descent per row. Th= e > > > locking and recheck would still be per-tuple, but the index traversal > > > cost drops significantly. Single-column FKs are the obvious starting > > > point. That seems worth exploring but can be done as a separate patch > > > on top of this. > > > > I will take a look at this in the following weeks. > > I ended up going ahead with the batching and SAOP idea that David > mentioned -- I had a proof-of-concept working shortly after posting v3 > and kept iterating on it. So attached set is now: > > 0001 - Core fast path (your 0001+0002 reworked, as before) > > 0002 - Per-batch resource caching (PK relation, index, scandesc, snapshot= ) > > 0003 - FK row buffering: materialize FK tuples into a per-constraint > batch buffer (64 rows), flush when full or at batch end > > 0004 - SK_SEARCHARRAY for single-column FKs: build an array from the > buffered FK values and do one index scan instead of 64 separate tree > descents. Multi-column FKs fall back to a per-row loop. > > 0003 is pure infrastructure -- it doesn't improve performance on its > own because the per-row index descent still dominates. The payoff > comes in 0004. > > Numbers (same machine as before, median of 3 runs): > > numeric PK / bigint FK, 1M rows: > master: 2487 ms > 0001..0004: 1168 ms (2.1x) > > int PK / int FK, 500K rows: > master: 1043 ms > 0001..0004: 335 ms (3.1x) > > The int/int case benefits most because the per-row cost is lower, so > the SAOP traversal savings are a larger fraction of the total. The > numeric/bigint case still sees a solid improvement despite the > cross-type cast overhead. > > Tomas Vondra also tested with an I/O-intensive workload (dataset > larger than shared_buffers, combined with his and Peter Geoghegan's > I/O prefetching patches) and confirmed that the batching + SAOP > approach helps there too, not just in the CPU-bound / memory-resident > case. In fact he showed that the patches here don't make a big dent > when the main bottleneck is I/O as shown in numbers that he shared in > an off-list email: > > master: 161617 ms > ri-check (0001..0004): 149446 ms (1.08x) > ri-check + i/o prefetching: 50885 ms (3.2x) > > So the RI patches alone only give ~8% here since most time is waiting > on reads. But the batching gives the prefetch machinery a window of > upcoming probes to issue readahead against, so the two together yield > 3.2x. impressive! > > Tomas also caught a memory context bug in the batch flush path: the > cached scandesc lives in TopTransactionContext, but the btree AM > defers _bt_preprocess_keys allocation to the first getnext call, which > pallocs into CurrentMemoryContext. If that's a short-lived > per-trigger-row context, the scandesc has dangling pointers on the > next rescan. Fixed by switching to TopTransactionContext before the > probe loop. > > Finally, I've fixed a number of other small and not-so-small bugs > found while polishing the old patches and made other stylistic > improvements. One notable change is that I introduced a FastPathMeta Yeah, this is much better than the fpmeta_valid field. > struct to store the fast path metadata instead of dumping those arrays > in the RI_ConstraintInfo. It's allocated lazily on first use and holds > the per-key compare entries, operator procedures, and index strategy > info needed by the scan key construction, so RI_ConstraintInfo doesn't > pay for them when the fast path isn't used. > > > On Mon, Feb 23, 2026 at 10:45=E2=80=AFPM Junwang Zhao = wrote: > > > > Hi Amit, > > > > On Thu, Feb 19, 2026 at 5:21=E2=80=AFPM Amit Langote wrote: > > > > > > Hi Junwang, > > > > > > On Mon, Dec 1, 2025 at 3:09=E2=80=AFPM Junwang Zhao wrote: > > > > As Amit has already stated, we are approaching a hybrid "fast-path = + fallback" > > > > design. > > > > > > > > 0001 adds a fast path optimization for foreign key constraint check= s > > > > that bypasses the SPI executor, the fast path applies when the refe= renced > > > > table is not partitioned, and the constraint does not involve tempo= ral > > > > semantics. > > > > > > > > With the following test: > > > > > > > > create table pk (a numeric primary key); > > > > create table fk (a bigint references pk); > > > > insert into pk select generate_series(1, 2000000); > > > > > > > > head: > > > > > > > > [local] zhjwpku@postgres:5432-90419=3D# insert into fk select > > > > generate_series(1, 2000000, 2); > > > > INSERT 0 1000000 > > > > Time: 13516.177 ms (00:13.516) > > > > > > > > [local] zhjwpku@postgres:5432-90419=3D# update fk set a =3D a + 1; > > > > UPDATE 1000000 > > > > Time: 15057.638 ms (00:15.058) > > > > > > > > patched: > > > > > > > > [local] zhjwpku@postgres:5432-98673=3D# insert into fk select > > > > generate_series(1, 2000000, 2); > > > > INSERT 0 1000000 > > > > Time: 8248.777 ms (00:08.249) > > > > > > > > [local] zhjwpku@postgres:5432-98673=3D# update fk set a =3D a + 1; > > > > UPDATE 1000000 > > > > Time: 10117.002 ms (00:10.117) > > > > > > > > 0002 cache fast-path metadata used by the index probe, at the curre= nt > > > > time only comparison operator hash entries, operator function OIDs > > > > and strategy numbers and subtypes for index scans. But this cache > > > > doesn't buy any performance improvement. > > > > > > > > Caching additional metadata should improve performance for foreign = key checks. > > > > > > > > Amit suggested introducing a mechanism for ri_triggers.c to registe= r a > > > > cleanup callback in the EState, which AfterTriggerEndQuery() could = then > > > > invoke to release per-statement cached metadata (such as the IndexS= canDesc). > > > > However, I haven't been able to implement this mechanism yet. > > > > > > Thanks for working on this. I've taken your patches as a starting > > > point and reworked the series into two patches (attached): 1st is you= r > > > 0001+0002 as the core patch that adds a gated fast-path alternative t= o > > > SPI and 2nd where I added per-statement resource caching. Doing the > > > latter turned out to be not so hard thanks to the structure you chose > > > to build the core fast path. Good call on adding the RLS and ACL tes= t > > > cases, btw. > > > > > > So, 0001 is a functionally complete fast path: concurrency handling, > > > REPEATABLE READ crosscheck, cross-type operators, security context, > > > and metadata caching. 0002 implements the per-statement resource > > > caching we discussed, though instead of sharing the EState between > > > trigger.c and ri_triggers.c it uses a new AfterTriggerBatchCallback > > > mechanism that fires at the end of each trigger-firing cycle > > > (per-statement for immediate constraints, or until COMMIT for deferre= d > > > ones). It layers resource caching on top so that the PK relation, > > > index, scan descriptor, and snapshot stay open across all FK trigger > > > invocations within a single trigger-firing cycle rather than being > > > opened and closed per row. > > > > > > Note that phe previous 0002 (metadata caching) is folded into 0001, > > > and most of the new fast-path logic added in 0001 now lives in > > > ri_FastPathCheck() rather than inline in RI_FKey_check(), so the > > > RI_FKey_check diff is just the gating call and SPI fallback. > > > > > > I re-ran the benchmarks (same test as yours, different machine): > > > > > > create table pk (a numeric primary key); > > > create table fk (a bigint references pk); > > > insert into pk select generate_series(1, 2000000); > > > insert into fk select generate_series(1, 2000000, 2); > > > > > > master: 2444 ms (median of 3 runs) > > > 0001: 1382 ms (43% faster) > > > 0001+0002: 1202 ms (51% faster, 13% over 0001 alone) > > > > I can get similar improvement on my old mac intel chip: > > > > master: 12963.993 ms > > 0001: 6641.692 ms, 48.8% faster > > 0001+0002: 5771.703 ms, 55.5% faster > > > > > > > > Also, with int PK / int FK (1M rows): > > > > > > create table pk (a int primary key); > > > create table fk (a int references pk); > > > insert into pk select generate_series(1, 1000000); > > > insert into fk select generate_series(1, 1000000); > > > > > > master: 1000 ms > > > 0001: 520 ms (48% faster) > > > 0001+0002: 432 ms (57% faster, 17% over 0001 alone) > > > > master: 11134.583 ms > > 0001: 5240.298 ms, 52.9% faster > > 0001+0002: 4554.215 ms, 59.1% faster > > > > > > > > The incremental gain from 0002 comes from eliminating per-row relatio= n > > > open/close, scan begin/end, slot alloc/free, and replacing per-row > > > GetSnapshotData() with only curcid adjustment on the registered > > > snapshot copy in the cache. > > > > > > The two current limitations are partitioned referenced tables and > > > temporal foreign keys. Partitioned PKs are relatively uncommon in > > > practice, so the non-partitioned case should cover most FK workloads, > > > so I'm not sure it's worth the added complexity to support them. > > > Temporal FKs are inherently multi-row, so they're a poor fit for a > > > single-probe fast path. > > > > > > David Rowley mentioned off-list that it might be worth batching > > > multiple FK values into a single index probe, leveraging the > > > ScalarArrayOp btree improvements from PostgreSQL 17. The idea would b= e > > > to buffer FK values across trigger invocations in the per-constraint > > > cache (0002 already has the right structure for this), build a > > > SK_SEARCHARRAY scan key, and let the btree AM walk the matching leaf > > > pages in one sorted traversal instead of one tree descent per row. Th= e > > > locking and recheck would still be per-tuple, but the index traversal > > > cost drops significantly. Single-column FKs are the obvious starting > > > point. That seems worth exploring but can be done as a separate patch > > > on top of this. > > > > I will take a look at this in the following weeks. > > > > > > > > I think the series is in reasonable shape but would appreciate extra > > > eyeballs, especially on the concurrency handling in ri_LockPKTuple() > > > in 0001 and the snapshot lifecycle in 0002. Or anything else that > > > catches one's eye. > > > > > > -- > > > Thanks, Amit Langote > > > > I don't have any additional comments on the patch except one minor nit, > > maybe merge the following two if conditions into one, not a strong opin= ion > > though. > > > > if (use_cache) > > { > > /* > > * The snapshot was registered once when the cache entry was created. > > * We just patch curcid to reflect the new command counter. > > * SnapshotSetCommandId() only patches process-global statics, not > > * registered copies, so we do it directly. > > * > > * The xmin/xmax/xip fields don't need refreshing: within a single > > * statement batch, only curcid changes between rows. > > */ > > Assert(fpentry && fpentry->snapshot !=3D NULL); > > snapshot =3D fpentry->snapshot; > > snapshot->curcid =3D GetCurrentCommandId(false); > > } > > else > > snapshot =3D RegisterSnapshot(GetLatestSnapshot()); > > > > if (use_cache) > > { > > pk_rel =3D fpentry->pk_rel; > > idx_rel =3D fpentry->idx_rel; > > scandesc =3D fpentry->scandesc; > > slot =3D fpentry->slot; > > } > > else > > { > > pk_rel =3D table_open(riinfo->pk_relid, RowShareLock); > > idx_rel =3D index_open(riinfo->conindid, AccessShareLock); > > scandesc =3D index_beginscan(pk_rel, idx_rel, > > snapshot, NULL, > > riinfo->nkeys, 0); > > slot =3D table_slot_create(pk_rel, NULL); > > } > > > > -- > > Regards > > Junwang Zhao > > > > -- > Thanks, Amit Langote --=20 Regards Junwang Zhao