public inbox for [email protected]
help / color / mirror / Atom feedFrom: Melanie Plageman <[email protected]>
To: David Rowley <[email protected]>
Cc: Tomas Vondra <[email protected]>
Cc: Andres Freund <[email protected]>
Cc: Kirill Reshke <[email protected]>
Cc: Chao Li <[email protected]>
Cc: Andrey Borodin <[email protected]>
Cc: Xuneng Zhou <[email protected]>
Cc: Robert Haas <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Cc: Heikki Linnakangas <[email protected]>
Subject: Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)
Date: Sun, 29 Mar 2026 13:16:39 -0400
Message-ID: <CAAKRu_atv6zA274m8Ysgbfn49c0NbdvHT7nXvd9kroZKnFq8Dg@mail.gmail.com> (raw)
In-Reply-To: <CAAKRu_Yt76_HdfR6DtK_wtkSNSj9=VxSV_npt+6T2R=zTzp1Pg@mail.gmail.com>
References: <CAAKRu_Z8Ry_ynNBPAzs_Ry3MQi9NaBgt1ccLgwRsDbxWpocaBg@mail.gmail.com>
<CAAKRu_ZbOp52rnkjf63h5mf94raEKBH7AAbz6QTx-xdH9yLfmQ@mail.gmail.com>
<CAAKRu_b8m+iuupm4ZX+2_V5Xj5u4jCTrU=Tv=epg6p4H2SMkFQ@mail.gmail.com>
<CALdSSPh9hVXNiPwdntWqbMzu5upKy0jBDDe7Un0_Nf2A54R2VQ@mail.gmail.com>
<CAAKRu_a6Cd7JnxhY4A=b_Paxc3UDUDOPeqV3GbzMh=R2KkD-uQ@mail.gmail.com>
<jlsvov4o5xswxjvjhvuchz6y55ncvoc457grvxct7cub5gcxuj@4e2toesujnpr>
<CAAKRu_bsHbvt+VqbjHXsdphKf8fqwBEutRhH3fmo+qUVe=yBKw@mail.gmail.com>
<CAAKRu_ZhHtEaucO--SdYrCjq0zgqk_LPztUD+-QS74A2htXgKw@mail.gmail.com>
<CAAKRu_Zj8G4T=HN3QSY7iQvkKSQk-k1fq+eJkjCBNqoSg63z+Q@mail.gmail.com>
<CAAKRu_bgP-DMZs=D2j2N0+U9+uWU5cVagw-yZLOuhYbWj_KwnA@mail.gmail.com>
<itvgqc6vncbjsjfmrptfvkkeg5vqzhalaguya2z77t6c6ctpc3@wsdrgbn4bxaa>
<CAAKRu_aWMyGB=zg5W7+RUtor6TqsiOwHXSL7Dg4TUUiTSzzcpw@mail.gmail.com>
<[email protected]>
<CAAKRu_Ypa7-JGVR+fstDxU5Cfitk_rf5ijdaqwtoPkztursufA@mail.gmail.com>
<CAAKRu_ZrDadxmGepBwPZ03yAKnMxwsHYn8SK9Gg7VqigLLVUWg@mail.gmail.com>
<CAApHDvqAOeOwCKh9g0gfxWa040=Hyc7_oA=C59rjod8kXJDWyw@mail.gmail.com>
<CAAKRu_Yt76_HdfR6DtK_wtkSNSj9=VxSV_npt+6T2R=zTzp1Pg@mail.gmail.com>
On Fri, Mar 27, 2026 at 3:17 PM Melanie Plageman
<[email protected]> wrote:
>
> Done in v47.
Attached v48 does a bit more cleanup. No functional changes. I'm
planning to push this soon. I think my remaining question is whether I
should move the row marks and result relation bitmaps into the estate.
I'm leaning toward not doing that and leaving them in the PlannedStmt.
Anyway, If I want to replace the list of result relation RTIs in the
PlannedStmt, I have to leave the bitmapset version there.
- Melanie
Attachments:
[text/x-patch] v48-0001-Make-it-cheap-to-check-if-a-relation-is-modified.patch (4.4K, 2-v48-0001-Make-it-cheap-to-check-if-a-relation-is-modified.patch)
download | inline diff:
From 04d24039ec7c14672955aaaba37e3aa512858a0d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Fri, 27 Mar 2026 09:21:22 -0400
Subject: [PATCH v48 1/6] Make it cheap to check if a relation is modified by a
query
Save the range table indexes of result relations and row mark relations
in separate bitmaps in the PlannedStmt. Precomputing them allows cheap
membership checks during execution. With a few exceptions, these two
groups comprise all relations that will be modified by a query. This
includes relations targeted by INSERT, UPDATE, DELETE, and MERGE as well
as relations with any row mark (like SELECT FOR UPDATE).
A later commit will use this information during scans to control whether
or not on-access pruning is allowed to set the visibility map -- which
would be counterproductive if the query will modify the page.
PlannedStmt->resultRelations is only used in a membership check, so it
may make sense to replace its usage with the new resultRelationRelids.
Author: Melanie Plageman <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: David Rowley <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://postgr.es/m/F5CDD1B5-628C-44A1-9F85-3958C626F6A9%40gmail.com
---
src/backend/executor/execParallel.c | 2 ++
src/backend/optimizer/plan/planner.c | 19 ++++++++++++++++++-
src/include/nodes/plannodes.h | 9 +++++++++
3 files changed, 29 insertions(+), 1 deletion(-)
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index ac84af294c9..791fcb88de9 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -188,6 +188,8 @@ ExecSerializePlan(Plan *plan, EState *estate)
pstmt->partPruneInfos = estate->es_part_prune_infos;
pstmt->rtable = estate->es_range_table;
pstmt->unprunableRelids = estate->es_unpruned_relids;
+ pstmt->resultRelationRelids = estate->es_plannedstmt->resultRelationRelids;
+ pstmt->rowMarkRelids = estate->es_plannedstmt->rowMarkRelids;
pstmt->permInfos = estate->es_rteperminfos;
pstmt->resultRelations = NIL;
pstmt->appendRelations = NIL;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index d19800ad6a5..df4c99fc3ff 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -340,8 +340,11 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
RelOptInfo *final_rel;
Path *best_path;
Plan *top_plan;
+ Bitmapset *resultRelationRelids = NULL;
+ Bitmapset *rowMarkRelids = NULL;
ListCell *lp,
- *lr;
+ *lr,
+ *lc;
/*
* Set up global state for this planner invocation. This data is needed
@@ -661,6 +664,20 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
result->subplans = glob->subplans;
result->rewindPlanIDs = glob->rewindPlanIDs;
result->rowMarks = glob->finalrowmarks;
+
+ /*
+ * Compute resultRelationRelids and rowMarkRelids from resultRelations and
+ * rowMarks for quick access.
+ */
+ foreach(lc, glob->resultRelations)
+ resultRelationRelids = bms_add_member(resultRelationRelids,
+ lfirst_int(lc));
+ foreach(lc, glob->finalrowmarks)
+ rowMarkRelids = bms_add_member(rowMarkRelids,
+ ((PlanRowMark *) lfirst(lc))->rti);
+ result->resultRelationRelids = resultRelationRelids;
+ result->rowMarkRelids = rowMarkRelids;
+
result->relationOids = glob->relationOids;
result->invalItems = glob->invalItems;
result->paramExecTypes = glob->paramExecTypes;
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index b6185825fcb..88be65d7bde 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -121,6 +121,9 @@ typedef struct PlannedStmt
/* integer list of RT indexes, or NIL */
List *resultRelations;
+ /* RT indexes of result relations targeted by INSERT/UPDATE/DELETE/MERGE */
+ Bitmapset *resultRelationRelids;
+
/* list of AppendRelInfo nodes */
List *appendRelations;
@@ -138,6 +141,12 @@ typedef struct PlannedStmt
/* a list of PlanRowMark's */
List *rowMarks;
+ /*
+ * RT indexes of relations with row marks. Useful for quick membership
+ * checks instead of iterating through rowMarks.
+ */
+ Bitmapset *rowMarkRelids;
+
/* OIDs of relations the plan depends on */
List *relationOids;
--
2.43.0
[text/x-patch] v48-0002-Remove-PlannedStmt-resultRelations-in-favor-of-r.patch (3.8K, 3-v48-0002-Remove-PlannedStmt-resultRelations-in-favor-of-r.patch)
download | inline diff:
From 7c331c575a377b40a1dd1142b23fa3a8692de38f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Fri, 27 Mar 2026 08:35:00 -0400
Subject: [PATCH v48 2/6] Remove PlannedStmt->resultRelations in favor of
resultRelationRelids
PlannedStmt->resultRelations was an integer list of range table indexes.
Now that we have a bitmapset, which offers cheap membership checks,
remove the list and update all consumers to use the bitmapset.
---
contrib/pg_overexplain/pg_overexplain.c | 5 +++--
src/backend/executor/execParallel.c | 1 -
src/backend/executor/execUtils.c | 2 +-
src/backend/optimizer/plan/planner.c | 1 -
src/include/nodes/plannodes.h | 4 ----
5 files changed, 4 insertions(+), 9 deletions(-)
diff --git a/contrib/pg_overexplain/pg_overexplain.c b/contrib/pg_overexplain/pg_overexplain.c
index c2b90493cc6..b4e90909289 100644
--- a/contrib/pg_overexplain/pg_overexplain.c
+++ b/contrib/pg_overexplain/pg_overexplain.c
@@ -780,8 +780,9 @@ overexplain_range_table(PlannedStmt *plannedstmt, ExplainState *es)
overexplain_bitmapset("Unprunable RTIs", plannedstmt->unprunableRelids,
es);
if (es->format != EXPLAIN_FORMAT_TEXT ||
- plannedstmt->resultRelations != NIL)
- overexplain_intlist("Result RTIs", plannedstmt->resultRelations, es);
+ !bms_is_empty(plannedstmt->resultRelationRelids))
+ overexplain_bitmapset("Result RTIs", plannedstmt->resultRelationRelids,
+ es);
/* Close group, we're all done */
ExplainCloseGroup("Range Table", "Range Table", false, es);
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 791fcb88de9..1bab6160036 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -191,7 +191,6 @@ ExecSerializePlan(Plan *plan, EState *estate)
pstmt->resultRelationRelids = estate->es_plannedstmt->resultRelationRelids;
pstmt->rowMarkRelids = estate->es_plannedstmt->rowMarkRelids;
pstmt->permInfos = estate->es_rteperminfos;
- pstmt->resultRelations = NIL;
pstmt->appendRelations = NIL;
pstmt->planOrigin = PLAN_STMT_INTERNAL;
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 9886ab06b69..36c5285d252 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -733,7 +733,7 @@ ExecCreateScanSlotFromOuterPlan(EState *estate,
bool
ExecRelationIsTargetRelation(EState *estate, Index scanrelid)
{
- return list_member_int(estate->es_plannedstmt->resultRelations, scanrelid);
+ return bms_is_member(scanrelid, estate->es_plannedstmt->resultRelationRelids);
}
/* ----------------------------------------------------------------
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index df4c99fc3ff..9853443209d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -659,7 +659,6 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
glob->prunableRelids);
result->permInfos = glob->finalrteperminfos;
result->subrtinfos = glob->subrtinfos;
- result->resultRelations = glob->resultRelations;
result->appendRelations = glob->appendRelations;
result->subplans = glob->subplans;
result->rewindPlanIDs = glob->rewindPlanIDs;
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 88be65d7bde..19e5d814c59 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -117,10 +117,6 @@ typedef struct PlannedStmt
*/
List *permInfos;
- /* rtable indexes of target relations for INSERT/UPDATE/DELETE/MERGE */
- /* integer list of RT indexes, or NIL */
- List *resultRelations;
-
/* RT indexes of result relations targeted by INSERT/UPDATE/DELETE/MERGE */
Bitmapset *resultRelationRelids;
--
2.43.0
[text/x-patch] v48-0003-Thread-flags-through-begin-scan-APIs.patch (37.1K, 4-v48-0003-Thread-flags-through-begin-scan-APIs.patch)
download | inline diff:
From 05cc37abae70327fda4bee4a392dfebcc08ec3c5 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Thu, 19 Mar 2026 17:05:55 -0400
Subject: [PATCH v48 3/6] Thread flags through begin-scan APIs
Add an AM user-settable flags parameter to several of the table
scan functions, one table AM callback, and index_beginscan(). This
allows users to pass additional context to be used when building the
scan descriptors.
For index scans, a new flags field is added to IndexFetchTableData, and
the heap AM saves the caller-provided flags there.
This introduces an extension point for follow-up work to pass
per-scan information (such as whether the relation is read-only for the
current query) from the executor to the AM layer.
Author: Melanie Plageman <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Tomas Vondra <[email protected]>
Reviewed-by: David Rowley <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://postgr.es/m/F5CDD1B5-628C-44A1-9F85-3958C626F6A9%40gmail.com
---
contrib/pgrowlocks/pgrowlocks.c | 2 +-
src/backend/access/brin/brin.c | 3 +-
src/backend/access/gin/gininsert.c | 3 +-
src/backend/access/heap/heapam_handler.c | 9 +-
src/backend/access/index/genam.c | 6 +-
src/backend/access/index/indexam.c | 13 ++-
src/backend/access/nbtree/nbtsort.c | 3 +-
src/backend/access/table/tableam.c | 22 ++---
src/backend/commands/constraint.c | 3 +-
src/backend/commands/copyto.c | 3 +-
src/backend/commands/tablecmds.c | 13 +--
src/backend/commands/typecmds.c | 6 +-
src/backend/executor/execIndexing.c | 4 +-
src/backend/executor/execReplication.c | 12 ++-
src/backend/executor/nodeBitmapHeapscan.c | 3 +-
src/backend/executor/nodeIndexonlyscan.c | 9 +-
src/backend/executor/nodeIndexscan.c | 12 ++-
src/backend/executor/nodeSamplescan.c | 3 +-
src/backend/executor/nodeSeqscan.c | 9 +-
src/backend/executor/nodeTidrangescan.c | 7 +-
src/backend/partitioning/partbounds.c | 3 +-
src/backend/utils/adt/selfuncs.c | 3 +-
src/include/access/genam.h | 6 +-
src/include/access/heapam.h | 5 +-
src/include/access/relscan.h | 6 ++
src/include/access/tableam.h | 103 ++++++++++++++++------
26 files changed, 185 insertions(+), 86 deletions(-)
diff --git a/contrib/pgrowlocks/pgrowlocks.c b/contrib/pgrowlocks/pgrowlocks.c
index ff3692c87c4..d164c4c03ad 100644
--- a/contrib/pgrowlocks/pgrowlocks.c
+++ b/contrib/pgrowlocks/pgrowlocks.c
@@ -115,7 +115,7 @@ pgrowlocks(PG_FUNCTION_ARGS)
RelationGetRelationName(rel));
/* Scan the relation */
- scan = table_beginscan(rel, GetActiveSnapshot(), 0, NULL);
+ scan = table_beginscan(rel, GetActiveSnapshot(), 0, NULL, SO_NONE);
hscan = (HeapScanDesc) scan;
attinmeta = TupleDescGetAttInMetadata(rsinfo->setDesc);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 2a0f8c8e3b8..bdb30752e09 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2844,7 +2844,8 @@ _brin_parallel_scan_and_build(BrinBuildState *state,
indexInfo->ii_Concurrent = brinshared->isconcurrent;
scan = table_beginscan_parallel(heap,
- ParallelTableScanFromBrinShared(brinshared));
+ ParallelTableScanFromBrinShared(brinshared),
+ SO_NONE);
reltuples = table_index_build_scan(heap, index, indexInfo, true, true,
brinbuildCallbackParallel, state, scan);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index e54782d9dd8..9d83a495775 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -2068,7 +2068,8 @@ _gin_parallel_scan_and_build(GinBuildState *state,
indexInfo->ii_Concurrent = ginshared->isconcurrent;
scan = table_beginscan_parallel(heap,
- ParallelTableScanFromGinBuildShared(ginshared));
+ ParallelTableScanFromGinBuildShared(ginshared),
+ SO_NONE);
reltuples = table_index_build_scan(heap, index, indexInfo, true, progress,
ginBuildCallbackParallel, state, scan);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d40878928e1..99280cd8159 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -81,11 +81,12 @@ heapam_slot_callbacks(Relation relation)
*/
static IndexFetchTableData *
-heapam_index_fetch_begin(Relation rel)
+heapam_index_fetch_begin(Relation rel, uint32 flags)
{
IndexFetchHeapData *hscan = palloc0_object(IndexFetchHeapData);
hscan->xs_base.rel = rel;
+ hscan->xs_base.flags = flags;
hscan->xs_cbuf = InvalidBuffer;
hscan->xs_vmbuffer = InvalidBuffer;
@@ -763,7 +764,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
tableScan = NULL;
heapScan = NULL;
- indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0);
+ indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0,
+ SO_NONE);
index_rescan(indexScan, NULL, 0, NULL, 0);
}
else
@@ -772,7 +774,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP);
- tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
+ tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL,
+ SO_NONE);
heapScan = (HeapScanDesc) tableScan;
indexScan = NULL;
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 5e89b86a62c..1408989c568 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -455,7 +455,8 @@ systable_beginscan(Relation heapRelation,
}
sysscan->iscan = index_beginscan(heapRelation, irel,
- snapshot, NULL, nkeys, 0);
+ snapshot, NULL, nkeys, 0,
+ SO_NONE);
index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
sysscan->scan = NULL;
@@ -716,7 +717,8 @@ systable_beginscan_ordered(Relation heapRelation,
bsysscan = true;
sysscan->iscan = index_beginscan(heapRelation, indexRelation,
- snapshot, NULL, nkeys, 0);
+ snapshot, NULL, nkeys, 0,
+ SO_NONE);
index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
sysscan->scan = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index fbfc33159eb..44496ae0963 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -258,7 +258,8 @@ index_beginscan(Relation heapRelation,
Relation indexRelation,
Snapshot snapshot,
IndexScanInstrumentation *instrument,
- int nkeys, int norderbys)
+ int nkeys, int norderbys,
+ uint32 flags)
{
IndexScanDesc scan;
@@ -285,7 +286,7 @@ index_beginscan(Relation heapRelation,
scan->instrument = instrument;
/* prepare to fetch index matches from table */
- scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
+ scan->xs_heapfetch = table_index_fetch_begin(heapRelation, flags);
return scan;
}
@@ -588,13 +589,17 @@ index_parallelrescan(IndexScanDesc scan)
/*
* index_beginscan_parallel - join parallel index scan
*
+ * flags is a bitmask of ScanOptions affecting the underlying table scan. No
+ * SO_INTERNAL_FLAGS are permitted.
+ *
* Caller must be holding suitable locks on the heap and the index.
*/
IndexScanDesc
index_beginscan_parallel(Relation heaprel, Relation indexrel,
IndexScanInstrumentation *instrument,
int nkeys, int norderbys,
- ParallelIndexScanDesc pscan)
+ ParallelIndexScanDesc pscan,
+ uint32 flags)
{
Snapshot snapshot;
IndexScanDesc scan;
@@ -616,7 +621,7 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
scan->instrument = instrument;
/* prepare to fetch index matches from table */
- scan->xs_heapfetch = table_index_fetch_begin(heaprel);
+ scan->xs_heapfetch = table_index_fetch_begin(heaprel, flags);
return scan;
}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 47a9bda30c9..756dfa3dcf4 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1928,7 +1928,8 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
indexInfo = BuildIndexInfo(btspool->index);
indexInfo->ii_Concurrent = btshared->isconcurrent;
scan = table_beginscan_parallel(btspool->heap,
- ParallelTableScanFromBTShared(btshared));
+ ParallelTableScanFromBTShared(btshared),
+ SO_NONE);
reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
true, progress, _bt_build_callback,
&buildstate, scan);
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index dfda1af412e..86481d7c029 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -118,7 +118,7 @@ table_beginscan_catalog(Relation relation, int nkeys, ScanKeyData *key)
Snapshot snapshot = RegisterSnapshot(GetCatalogSnapshot(relid));
return table_beginscan_common(relation, snapshot, nkeys, key,
- NULL, flags);
+ NULL, flags, SO_NONE);
}
@@ -163,10 +163,11 @@ table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
}
TableScanDesc
-table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
+table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan,
+ uint32 flags)
{
Snapshot snapshot;
- uint32 flags = SO_TYPE_SEQSCAN |
+ uint32 internal_flags = SO_TYPE_SEQSCAN |
SO_ALLOW_STRAT | SO_ALLOW_SYNC | SO_ALLOW_PAGEMODE;
Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
@@ -176,7 +177,7 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
/* Snapshot was serialized -- restore it */
snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
RegisterSnapshot(snapshot);
- flags |= SO_TEMP_SNAPSHOT;
+ internal_flags |= SO_TEMP_SNAPSHOT;
}
else
{
@@ -185,16 +186,17 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
}
return table_beginscan_common(relation, snapshot, 0, NULL,
- pscan, flags);
+ pscan, internal_flags, flags);
}
TableScanDesc
table_beginscan_parallel_tidrange(Relation relation,
- ParallelTableScanDesc pscan)
+ ParallelTableScanDesc pscan,
+ uint32 flags)
{
Snapshot snapshot;
- uint32 flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
TableScanDesc sscan;
+ uint32 internal_flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
@@ -206,7 +208,7 @@ table_beginscan_parallel_tidrange(Relation relation,
/* Snapshot was serialized -- restore it */
snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
RegisterSnapshot(snapshot);
- flags |= SO_TEMP_SNAPSHOT;
+ internal_flags |= SO_TEMP_SNAPSHOT;
}
else
{
@@ -215,7 +217,7 @@ table_beginscan_parallel_tidrange(Relation relation,
}
sscan = table_beginscan_common(relation, snapshot, 0, NULL,
- pscan, flags);
+ pscan, internal_flags, flags);
return sscan;
}
@@ -248,7 +250,7 @@ table_index_fetch_tuple_check(Relation rel,
bool found;
slot = table_slot_create(rel, NULL);
- scan = table_index_fetch_begin(rel);
+ scan = table_index_fetch_begin(rel, SO_NONE);
found = table_index_fetch_tuple(scan, tid, snapshot, slot, &call_again,
all_dead);
table_index_fetch_end(scan);
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index cc11c47b6f2..421d8c359f0 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -106,7 +106,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
*/
tmptid = checktid;
{
- IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation);
+ IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation,
+ SO_NONE);
bool call_again = false;
if (!table_index_fetch_tuple(scan, &tmptid, SnapshotSelf, slot,
diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c
index faf62d959b4..f0e0147c665 100644
--- a/src/backend/commands/copyto.c
+++ b/src/backend/commands/copyto.c
@@ -1336,7 +1336,8 @@ CopyRelationTo(CopyToState cstate, Relation rel, Relation root_rel, uint64 *proc
AttrMap *map = NULL;
TupleTableSlot *root_slot = NULL;
- scandesc = table_beginscan(rel, GetActiveSnapshot(), 0, NULL);
+ scandesc = table_beginscan(rel, GetActiveSnapshot(), 0, NULL,
+ SO_NONE);
slot = table_slot_create(rel, NULL);
/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index c69c12dc014..ec0063287d0 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -6411,7 +6411,8 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap)
* checking all the constraints.
*/
snapshot = RegisterSnapshot(GetLatestSnapshot());
- scan = table_beginscan(oldrel, snapshot, 0, NULL);
+ scan = table_beginscan(oldrel, snapshot, 0, NULL,
+ SO_NONE);
/*
* Switch to per-tuple memory context and reset it for each tuple
@@ -13980,8 +13981,8 @@ validateForeignKeyConstraint(char *conname,
*/
snapshot = RegisterSnapshot(GetLatestSnapshot());
slot = table_slot_create(rel, NULL);
- scan = table_beginscan(rel, snapshot, 0, NULL);
-
+ scan = table_beginscan(rel, snapshot, 0, NULL,
+ SO_NONE);
perTupCxt = AllocSetContextCreate(CurrentMemoryContext,
"validateForeignKeyConstraint",
ALLOCSET_SMALL_SIZES);
@@ -22882,7 +22883,8 @@ MergePartitionsMoveRows(List **wqueue, List *mergingPartitions, Relation newPart
/* Scan through the rows. */
snapshot = RegisterSnapshot(GetLatestSnapshot());
- scan = table_beginscan(mergingPartition, snapshot, 0, NULL);
+ scan = table_beginscan(mergingPartition, snapshot, 0, NULL,
+ SO_NONE);
/*
* Switch to per-tuple memory context and reset it for each tuple
@@ -23346,7 +23348,8 @@ SplitPartitionMoveRows(List **wqueue, Relation rel, Relation splitRel,
/* Scan through the rows. */
snapshot = RegisterSnapshot(GetLatestSnapshot());
- scan = table_beginscan(splitRel, snapshot, 0, NULL);
+ scan = table_beginscan(splitRel, snapshot, 0, NULL,
+ SO_NONE);
/*
* Switch to per-tuple memory context and reset it for each tuple
diff --git a/src/backend/commands/typecmds.c b/src/backend/commands/typecmds.c
index 3dab6bb5a79..cd38e9cddf4 100644
--- a/src/backend/commands/typecmds.c
+++ b/src/backend/commands/typecmds.c
@@ -3185,7 +3185,8 @@ validateDomainNotNullConstraint(Oid domainoid)
/* Scan all tuples in this relation */
snapshot = RegisterSnapshot(GetLatestSnapshot());
- scan = table_beginscan(testrel, snapshot, 0, NULL);
+ scan = table_beginscan(testrel, snapshot, 0, NULL,
+ SO_NONE);
slot = table_slot_create(testrel, NULL);
while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
{
@@ -3266,7 +3267,8 @@ validateDomainCheckConstraint(Oid domainoid, const char *ccbin, LOCKMODE lockmod
/* Scan all tuples in this relation */
snapshot = RegisterSnapshot(GetLatestSnapshot());
- scan = table_beginscan(testrel, snapshot, 0, NULL);
+ scan = table_beginscan(testrel, snapshot, 0, NULL,
+ SO_NONE);
slot = table_slot_create(testrel, NULL);
while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 9d071e495c6..cc6eb3a6ee9 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -815,7 +815,9 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
retry:
conflict = false;
found_self = false;
- index_scan = index_beginscan(heap, index, &DirtySnapshot, NULL, indnkeyatts, 0);
+ index_scan = index_beginscan(heap, index,
+ &DirtySnapshot, NULL, indnkeyatts, 0,
+ SO_NONE);
index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 2497ee7edc5..fea8991cb04 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -205,7 +205,8 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
/* Start an index scan. */
- scan = index_beginscan(rel, idxrel, &snap, NULL, skey_attoff, 0);
+ scan = index_beginscan(rel, idxrel,
+ &snap, NULL, skey_attoff, 0, SO_NONE);
retry:
found = false;
@@ -383,7 +384,8 @@ RelationFindReplTupleSeq(Relation rel, LockTupleMode lockmode,
/* Start a heap scan. */
InitDirtySnapshot(snap);
- scan = table_beginscan(rel, &snap, 0, NULL);
+ scan = table_beginscan(rel, &snap, 0, NULL,
+ SO_NONE);
scanslot = table_slot_create(rel, NULL);
retry:
@@ -602,7 +604,8 @@ RelationFindDeletedTupleInfoSeq(Relation rel, TupleTableSlot *searchslot,
* not yet committed or those just committed prior to the scan are
* excluded in update_most_recent_deletion_info().
*/
- scan = table_beginscan(rel, SnapshotAny, 0, NULL);
+ scan = table_beginscan(rel, SnapshotAny, 0, NULL,
+ SO_NONE);
scanslot = table_slot_create(rel, NULL);
table_rescan(scan, NULL);
@@ -666,7 +669,8 @@ RelationFindDeletedTupleInfoByIndex(Relation rel, Oid idxoid,
* not yet committed or those just committed prior to the scan are
* excluded in update_most_recent_deletion_info().
*/
- scan = index_beginscan(rel, idxrel, SnapshotAny, NULL, skey_attoff, 0);
+ scan = index_beginscan(rel, idxrel,
+ SnapshotAny, NULL, skey_attoff, 0, SO_NONE);
index_rescan(scan, skey, skey_attoff, NULL, 0);
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 7cf8d23c742..69683d81527 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -148,7 +148,8 @@ BitmapTableScanSetup(BitmapHeapScanState *node)
table_beginscan_bm(node->ss.ss_currentRelation,
node->ss.ps.state->es_snapshot,
0,
- NULL);
+ NULL,
+ SO_NONE);
}
node->ss.ss_currentScanDesc->st.rs_tbmiterator = tbmiterator;
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9eab81fd1c8..02df40f32c5 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -95,7 +95,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
estate->es_snapshot,
node->ioss_Instrument,
node->ioss_NumScanKeys,
- node->ioss_NumOrderByKeys);
+ node->ioss_NumOrderByKeys,
+ SO_NONE);
node->ioss_ScanDesc = scandesc;
@@ -794,7 +795,8 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
node->ioss_Instrument,
node->ioss_NumScanKeys,
node->ioss_NumOrderByKeys,
- piscan);
+ piscan,
+ SO_NONE);
node->ioss_ScanDesc->xs_want_itup = true;
node->ioss_VMBuffer = InvalidBuffer;
@@ -860,7 +862,8 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
node->ioss_Instrument,
node->ioss_NumScanKeys,
node->ioss_NumOrderByKeys,
- piscan);
+ piscan,
+ SO_NONE);
node->ioss_ScanDesc->xs_want_itup = true;
/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 06143e94c5a..3c0b8daf664 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -113,7 +113,8 @@ IndexNext(IndexScanState *node)
estate->es_snapshot,
node->iss_Instrument,
node->iss_NumScanKeys,
- node->iss_NumOrderByKeys);
+ node->iss_NumOrderByKeys,
+ SO_NONE);
node->iss_ScanDesc = scandesc;
@@ -209,7 +210,8 @@ IndexNextWithReorder(IndexScanState *node)
estate->es_snapshot,
node->iss_Instrument,
node->iss_NumScanKeys,
- node->iss_NumOrderByKeys);
+ node->iss_NumOrderByKeys,
+ SO_NONE);
node->iss_ScanDesc = scandesc;
@@ -1730,7 +1732,8 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
node->iss_Instrument,
node->iss_NumScanKeys,
node->iss_NumOrderByKeys,
- piscan);
+ piscan,
+ SO_NONE);
/*
* If no run-time keys to calculate or they are ready, go ahead and pass
@@ -1794,7 +1797,8 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
node->iss_Instrument,
node->iss_NumScanKeys,
node->iss_NumOrderByKeys,
- piscan);
+ piscan,
+ SO_NONE);
/*
* If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/executor/nodeSamplescan.c b/src/backend/executor/nodeSamplescan.c
index 6b0d65f752f..cf32df33d82 100644
--- a/src/backend/executor/nodeSamplescan.c
+++ b/src/backend/executor/nodeSamplescan.c
@@ -298,7 +298,8 @@ tablesample_init(SampleScanState *scanstate)
0, NULL,
scanstate->use_bulkread,
allow_sync,
- scanstate->use_pagemode);
+ scanstate->use_pagemode,
+ SO_NONE);
}
else
{
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 8f219f60a93..09ccc65de1c 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -71,7 +71,8 @@ SeqNext(SeqScanState *node)
*/
scandesc = table_beginscan(node->ss.ss_currentRelation,
estate->es_snapshot,
- 0, NULL);
+ 0, NULL,
+ SO_NONE);
node->ss.ss_currentScanDesc = scandesc;
}
@@ -375,7 +376,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
estate->es_snapshot);
shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
node->ss.ss_currentScanDesc =
- table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
+ table_beginscan_parallel(node->ss.ss_currentRelation, pscan,
+ SO_NONE);
}
/* ----------------------------------------------------------------
@@ -408,5 +410,6 @@ ExecSeqScanInitializeWorker(SeqScanState *node,
pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
node->ss.ss_currentScanDesc =
- table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
+ table_beginscan_parallel(node->ss.ss_currentRelation, pscan,
+ SO_NONE);
}
diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c
index 617713bde04..084e4c6ec90 100644
--- a/src/backend/executor/nodeTidrangescan.c
+++ b/src/backend/executor/nodeTidrangescan.c
@@ -245,7 +245,8 @@ TidRangeNext(TidRangeScanState *node)
scandesc = table_beginscan_tidrange(node->ss.ss_currentRelation,
estate->es_snapshot,
&node->trss_mintid,
- &node->trss_maxtid);
+ &node->trss_maxtid,
+ SO_NONE);
node->ss.ss_currentScanDesc = scandesc;
}
else
@@ -460,7 +461,7 @@ ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt)
shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
node->ss.ss_currentScanDesc =
table_beginscan_parallel_tidrange(node->ss.ss_currentRelation,
- pscan);
+ pscan, SO_NONE);
}
/* ----------------------------------------------------------------
@@ -494,5 +495,5 @@ ExecTidRangeScanInitializeWorker(TidRangeScanState *node,
pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
node->ss.ss_currentScanDesc =
table_beginscan_parallel_tidrange(node->ss.ss_currentRelation,
- pscan);
+ pscan, SO_NONE);
}
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index 0ca312ac27d..f867d1b75a5 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -3362,7 +3362,8 @@ check_default_partition_contents(Relation parent, Relation default_rel,
econtext = GetPerTupleExprContext(estate);
snapshot = RegisterSnapshot(GetLatestSnapshot());
tupslot = table_slot_create(part_rel, &estate->es_tupleTable);
- scan = table_beginscan(part_rel, snapshot, 0, NULL);
+ scan = table_beginscan(part_rel, snapshot, 0, NULL,
+ SO_NONE);
/*
* Switch to per-tuple memory context and reset it for each tuple
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 53f85ccde01..4160d2d6e24 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -7178,7 +7178,8 @@ get_actual_variable_endpoint(Relation heapRel,
index_scan = index_beginscan(heapRel, indexRel,
&SnapshotNonVacuumable, NULL,
- 1, 0);
+ 1, 0,
+ SO_NONE);
/* Set it up for index-only scan */
index_scan->xs_want_itup = true;
index_rescan(index_scan, scankeys, 1, NULL, 0);
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 1a27bf060b3..b69320a7fc8 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -158,7 +158,8 @@ extern IndexScanDesc index_beginscan(Relation heapRelation,
Relation indexRelation,
Snapshot snapshot,
IndexScanInstrumentation *instrument,
- int nkeys, int norderbys);
+ int nkeys, int norderbys,
+ uint32 flags);
extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
Snapshot snapshot,
IndexScanInstrumentation *instrument,
@@ -184,7 +185,8 @@ extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
Relation indexrel,
IndexScanInstrumentation *instrument,
int nkeys, int norderbys,
- ParallelIndexScanDesc pscan);
+ ParallelIndexScanDesc pscan,
+ uint32 flags);
extern ItemPointer index_getnext_tid(IndexScanDesc scan,
ScanDirection direction);
extern bool index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 9b403203006..e2e07348f37 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -95,10 +95,7 @@ typedef struct HeapScanDescData
*/
ParallelBlockTableScanWorkerData *rs_parallelworkerdata;
- /*
- * For sequential scans and bitmap heap scans. The current heap block's
- * corresponding page in the visibility map.
- */
+ /* Current heap block's corresponding page in the visibility map */
Buffer rs_vmbuffer;
/* these fields only used in page-at-a-time mode and for bitmap scans */
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index ce340c076f8..960abf6c214 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -122,6 +122,12 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
typedef struct IndexFetchTableData
{
Relation rel;
+
+ /*
+ * Bitmask of ScanOptions affecting the relation. No SO_INTERNAL_FLAGS are
+ * permitted.
+ */
+ uint32 flags;
} IndexFetchTableData;
struct IndexScanInstrumentation;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 06084752245..f8d1423b2d0 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -45,6 +45,8 @@ typedef struct ValidateIndexState ValidateIndexState;
*/
typedef enum ScanOptions
{
+ SO_NONE = 0,
+
/* one of SO_TYPE_* may be specified */
SO_TYPE_SEQSCAN = 1 << 0,
SO_TYPE_BITMAPSCAN = 1 << 1,
@@ -65,6 +67,19 @@ typedef enum ScanOptions
SO_TEMP_SNAPSHOT = 1 << 9,
} ScanOptions;
+/*
+ * Mask of flags that are set internally by the table scan functions and
+ * shouldn't be passed by callers. Some of these are effectively set by callers
+ * through parameters to table scan functions (e.g. SO_ALLOW_STRAT/allow_strat),
+ * however, for now, retain tight control over them and don't allow users to
+ * pass these themselves to table scan functions.
+ */
+#define SO_INTERNAL_FLAGS \
+ (SO_TYPE_SEQSCAN | SO_TYPE_BITMAPSCAN | SO_TYPE_SAMPLESCAN | \
+ SO_TYPE_TIDSCAN | SO_TYPE_TIDRANGESCAN | SO_TYPE_ANALYZE | \
+ SO_ALLOW_STRAT | SO_ALLOW_SYNC | SO_ALLOW_PAGEMODE | \
+ SO_TEMP_SNAPSHOT)
+
/*
* Result codes for table_{update,delete,lock_tuple}, and for visibility
* routines inside table AMs.
@@ -321,8 +336,9 @@ typedef struct TableAmRoutine
* `flags` is a bitmask indicating the type of scan (ScanOptions's
* SO_TYPE_*, currently only one may be specified), options controlling
* the scan's behaviour (ScanOptions's SO_ALLOW_*, several may be
- * specified, an AM may ignore unsupported ones) and whether the snapshot
- * needs to be deallocated at scan_end (ScanOptions's SO_TEMP_SNAPSHOT).
+ * specified, an AM may ignore unsupported ones), whether the snapshot
+ * needs to be deallocated at scan_end (ScanOptions's SO_TEMP_SNAPSHOT),
+ * and any number of the other ScanOptions values.
*/
TableScanDesc (*scan_begin) (Relation rel,
Snapshot snapshot,
@@ -418,9 +434,12 @@ typedef struct TableAmRoutine
* IndexFetchTableData, which the AM will typically embed in a larger
* structure with additional information.
*
+ * flags is a bitmask of ScanOptions affecting underlying table scan
+ * behavior. See scan_begin() for more information on passing these.
+ *
* Tuples for an index scan can then be fetched via index_fetch_tuple.
*/
- struct IndexFetchTableData *(*index_fetch_begin) (Relation rel);
+ struct IndexFetchTableData *(*index_fetch_begin) (Relation rel, uint32 flags);
/*
* Reset index fetch. Typically this will release cross index fetch
@@ -871,12 +890,19 @@ extern TupleTableSlot *table_slot_create(Relation relation, List **reglist);
* A wrapper around the Table Access Method scan_begin callback, to centralize
* error checking. All calls to ->scan_begin() should go through this
* function.
+ *
+ * The caller-provided user_flags are validated against SO_INTERNAL_FLAGS to
+ * catch callers that accidentally pass scan-type or other internal flags.
*/
static TableScanDesc
table_beginscan_common(Relation rel, Snapshot snapshot, int nkeys,
ScanKeyData *key, ParallelTableScanDesc pscan,
- uint32 flags)
+ uint32 flags, uint32 user_flags)
{
+ Assert((user_flags & SO_INTERNAL_FLAGS) == 0);
+ Assert((flags & ~SO_INTERNAL_FLAGS) == 0);
+ flags |= user_flags;
+
/*
* We don't allow scans to be started while CheckXidAlive is set, except
* via systable_beginscan() et al. See detailed comments in xact.c where
@@ -891,15 +917,18 @@ table_beginscan_common(Relation rel, Snapshot snapshot, int nkeys,
/*
* Start a scan of `rel`. Returned tuples pass a visibility test of
* `snapshot`, and if nkeys != 0, the results are filtered by those scan keys.
+ *
+ * flags is a bitmask of ScanOptions. No SO_INTERNAL_FLAGS are permitted.
*/
static inline TableScanDesc
table_beginscan(Relation rel, Snapshot snapshot,
- int nkeys, ScanKeyData *key)
+ int nkeys, ScanKeyData *key, uint32 flags)
{
- uint32 flags = SO_TYPE_SEQSCAN |
+ uint32 internal_flags = SO_TYPE_SEQSCAN |
SO_ALLOW_STRAT | SO_ALLOW_SYNC | SO_ALLOW_PAGEMODE;
- return table_beginscan_common(rel, snapshot, nkeys, key, NULL, flags);
+ return table_beginscan_common(rel, snapshot, nkeys, key, NULL,
+ internal_flags, flags);
}
/*
@@ -928,7 +957,8 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
if (allow_sync)
flags |= SO_ALLOW_SYNC;
- return table_beginscan_common(rel, snapshot, nkeys, key, NULL, flags);
+ return table_beginscan_common(rel, snapshot, nkeys, key, NULL,
+ flags, SO_NONE);
}
/*
@@ -936,14 +966,17 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
* TableScanDesc for a bitmap heap scan. Although that scan technology is
* really quite unlike a standard seqscan, there is just enough commonality to
* make it worth using the same data structure.
+ *
+ * flags is a bitmask of ScanOptions. No SO_INTERNAL_FLAGS are permitted.
*/
static inline TableScanDesc
table_beginscan_bm(Relation rel, Snapshot snapshot,
- int nkeys, ScanKeyData *key)
+ int nkeys, ScanKeyData *key, uint32 flags)
{
- uint32 flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE;
+ uint32 internal_flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE;
- return table_beginscan_common(rel, snapshot, nkeys, key, NULL, flags);
+ return table_beginscan_common(rel, snapshot, nkeys, key, NULL,
+ internal_flags, flags);
}
/*
@@ -952,23 +985,26 @@ table_beginscan_bm(Relation rel, Snapshot snapshot,
* using the same data structure although the behavior is rather different.
* In addition to the options offered by table_beginscan_strat, this call
* also allows control of whether page-mode visibility checking is used.
+ *
+ * flags is a bitmask of ScanOptions. No SO_INTERNAL_FLAGS are permitted.
*/
static inline TableScanDesc
table_beginscan_sampling(Relation rel, Snapshot snapshot,
int nkeys, ScanKeyData *key,
bool allow_strat, bool allow_sync,
- bool allow_pagemode)
+ bool allow_pagemode, uint32 flags)
{
- uint32 flags = SO_TYPE_SAMPLESCAN;
+ uint32 internal_flags = SO_TYPE_SAMPLESCAN;
if (allow_strat)
- flags |= SO_ALLOW_STRAT;
+ internal_flags |= SO_ALLOW_STRAT;
if (allow_sync)
- flags |= SO_ALLOW_SYNC;
+ internal_flags |= SO_ALLOW_SYNC;
if (allow_pagemode)
- flags |= SO_ALLOW_PAGEMODE;
+ internal_flags |= SO_ALLOW_PAGEMODE;
- return table_beginscan_common(rel, snapshot, nkeys, key, NULL, flags);
+ return table_beginscan_common(rel, snapshot, nkeys, key, NULL,
+ internal_flags, flags);
}
/*
@@ -981,7 +1017,8 @@ table_beginscan_tid(Relation rel, Snapshot snapshot)
{
uint32 flags = SO_TYPE_TIDSCAN;
- return table_beginscan_common(rel, snapshot, 0, NULL, NULL, flags);
+ return table_beginscan_common(rel, snapshot, 0, NULL, NULL,
+ flags, SO_NONE);
}
/*
@@ -994,7 +1031,8 @@ table_beginscan_analyze(Relation rel)
{
uint32 flags = SO_TYPE_ANALYZE;
- return table_beginscan_common(rel, NULL, 0, NULL, NULL, flags);
+ return table_beginscan_common(rel, NULL, 0, NULL, NULL,
+ flags, SO_NONE);
}
/*
@@ -1055,16 +1093,19 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
/*
* table_beginscan_tidrange is the entry point for setting up a TableScanDesc
* for a TID range scan.
+ *
+ * flags is a bitmask of ScanOptions. No SO_INTERNAL_FLAGS are permitted.
*/
static inline TableScanDesc
table_beginscan_tidrange(Relation rel, Snapshot snapshot,
ItemPointer mintid,
- ItemPointer maxtid)
+ ItemPointer maxtid, uint32 flags)
{
TableScanDesc sscan;
- uint32 flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
+ uint32 internal_flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
- sscan = table_beginscan_common(rel, snapshot, 0, NULL, NULL, flags);
+ sscan = table_beginscan_common(rel, snapshot, 0, NULL, NULL,
+ internal_flags, flags);
/* Set the range of TIDs to scan */
sscan->rs_rd->rd_tableam->scan_set_tidrange(sscan, mintid, maxtid);
@@ -1136,20 +1177,26 @@ extern void table_parallelscan_initialize(Relation rel,
* table_parallelscan_initialize(), for the same relation. The initialization
* does not need to have happened in this backend.
*
+ * flags is a bitmask of ScanOptions. No SO_INTERNAL_FLAGS are permitted.
+ *
* Caller must hold a suitable lock on the relation.
*/
extern TableScanDesc table_beginscan_parallel(Relation relation,
- ParallelTableScanDesc pscan);
+ ParallelTableScanDesc pscan,
+ uint32 flags);
/*
* Begin a parallel tid range scan. `pscan` needs to have been initialized
* with table_parallelscan_initialize(), for the same relation. The
* initialization does not need to have happened in this backend.
*
+ * flags is a bitmask of ScanOptions. No SO_INTERNAL_FLAGS are permitted.
+ *
* Caller must hold a suitable lock on the relation.
*/
extern TableScanDesc table_beginscan_parallel_tidrange(Relation relation,
- ParallelTableScanDesc pscan);
+ ParallelTableScanDesc pscan,
+ uint32 flags);
/*
* Restart a parallel scan. Call this in the leader process. Caller is
@@ -1172,11 +1219,15 @@ table_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan)
* Prepare to fetch tuples from the relation, as needed when fetching tuples
* for an index scan.
*
+ * flags is a bitmask of ScanOptions. No SO_INTERNAL_FLAGS are permitted.
+ *
* Tuples for an index scan can then be fetched via table_index_fetch_tuple().
*/
static inline IndexFetchTableData *
-table_index_fetch_begin(Relation rel)
+table_index_fetch_begin(Relation rel, uint32 flags)
{
+ Assert((flags & SO_INTERNAL_FLAGS) == 0);
+
/*
* We don't allow scans to be started while CheckXidAlive is set, except
* via systable_beginscan() et al. See detailed comments in xact.c where
@@ -1185,7 +1236,7 @@ table_index_fetch_begin(Relation rel)
if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
elog(ERROR, "scan started during logical decoding");
- return rel->rd_tableam->index_fetch_begin(rel);
+ return rel->rd_tableam->index_fetch_begin(rel, flags);
}
/*
--
2.43.0
[text/x-patch] v48-0004-Pass-down-information-on-table-modification-to-s.patch (10.0K, 5-v48-0004-Pass-down-information-on-table-modification-to-s.patch)
download | inline diff:
From 239ec276e5bee0f59ae0a91d0bd9eff8842c8a63 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Mon, 2 Mar 2026 16:31:33 -0500
Subject: [PATCH v48 4/6] Pass down information on table modification to scan
node
Pass down information to sequential scan, index [only] scan, bitmap
table scan, sample scan, and TID range scan nodes on whether or not the
query modifies the relation being scanned. A later commit will use this
information to update the VM during on-access pruning only if the
relation is not modified by the query.
Author: Melanie Plageman <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Andrey Borodin <[email protected]>
Reviewed-by: Tomas Vondra <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://postgr.es/m/4379FDA3-9446-4E2C-9C15-32EFE8D4F31B%40yandex-team.ru
---
src/backend/executor/execUtils.c | 21 +++++++++++++++++++++
src/backend/executor/nodeBitmapHeapscan.c | 3 ++-
src/backend/executor/nodeIndexonlyscan.c | 9 ++++++---
src/backend/executor/nodeIndexscan.c | 12 ++++++++----
src/backend/executor/nodeSamplescan.c | 3 ++-
src/backend/executor/nodeSeqscan.c | 10 +++++++---
src/backend/executor/nodeTidrangescan.c | 11 ++++++++---
src/include/access/tableam.h | 3 +++
src/include/executor/executor.h | 2 ++
9 files changed, 59 insertions(+), 15 deletions(-)
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 36c5285d252..f090de49921 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -736,6 +736,27 @@ ExecRelationIsTargetRelation(EState *estate, Index scanrelid)
return bms_is_member(scanrelid, estate->es_plannedstmt->resultRelationRelids);
}
+/*
+ * Return true if the scan node's relation is not modified by the query.
+ *
+ * This is not perfectly accurate. INSERT ... SELECT from the same table does
+ * not add the scan relation to resultRelationRelids, so it will be reported
+ * as read-only even though the query modifies it.
+ *
+ * Conversely, when any relation in the query has a modifying row mark, all
+ * other relations get a ROW_MARK_REFERENCE, causing them to be reported as
+ * not read-only even though they may only be read.
+ */
+bool
+ScanRelIsReadOnly(ScanState *ss)
+{
+ Index scanrelid = ((Scan *) ss->ps.plan)->scanrelid;
+ PlannedStmt *pstmt = ss->ps.state->es_plannedstmt;
+
+ return !bms_is_member(scanrelid, pstmt->resultRelationRelids) &&
+ !bms_is_member(scanrelid, pstmt->rowMarkRelids);
+}
+
/* ----------------------------------------------------------------
* ExecOpenScanRelation
*
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 69683d81527..73831aed451 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -149,7 +149,8 @@ BitmapTableScanSetup(BitmapHeapScanState *node)
node->ss.ps.state->es_snapshot,
0,
NULL,
- SO_NONE);
+ ScanRelIsReadOnly(&node->ss) ?
+ SO_HINT_REL_READ_ONLY : SO_NONE);
}
node->ss.ss_currentScanDesc->st.rs_tbmiterator = tbmiterator;
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 02df40f32c5..de6154fd541 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -96,7 +96,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_Instrument,
node->ioss_NumScanKeys,
node->ioss_NumOrderByKeys,
- SO_NONE);
+ ScanRelIsReadOnly(&node->ss) ?
+ SO_HINT_REL_READ_ONLY : SO_NONE);
node->ioss_ScanDesc = scandesc;
@@ -796,7 +797,8 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
node->ioss_NumScanKeys,
node->ioss_NumOrderByKeys,
piscan,
- SO_NONE);
+ ScanRelIsReadOnly(&node->ss) ?
+ SO_HINT_REL_READ_ONLY : SO_NONE);
node->ioss_ScanDesc->xs_want_itup = true;
node->ioss_VMBuffer = InvalidBuffer;
@@ -863,7 +865,8 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
node->ioss_NumScanKeys,
node->ioss_NumOrderByKeys,
piscan,
- SO_NONE);
+ ScanRelIsReadOnly(&node->ss) ?
+ SO_HINT_REL_READ_ONLY : SO_NONE);
node->ioss_ScanDesc->xs_want_itup = true;
/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 3c0b8daf664..1620d146071 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -114,7 +114,8 @@ IndexNext(IndexScanState *node)
node->iss_Instrument,
node->iss_NumScanKeys,
node->iss_NumOrderByKeys,
- SO_NONE);
+ ScanRelIsReadOnly(&node->ss) ?
+ SO_HINT_REL_READ_ONLY : SO_NONE);
node->iss_ScanDesc = scandesc;
@@ -211,7 +212,8 @@ IndexNextWithReorder(IndexScanState *node)
node->iss_Instrument,
node->iss_NumScanKeys,
node->iss_NumOrderByKeys,
- SO_NONE);
+ ScanRelIsReadOnly(&node->ss) ?
+ SO_HINT_REL_READ_ONLY : SO_NONE);
node->iss_ScanDesc = scandesc;
@@ -1733,7 +1735,8 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
node->iss_NumScanKeys,
node->iss_NumOrderByKeys,
piscan,
- SO_NONE);
+ ScanRelIsReadOnly(&node->ss) ?
+ SO_HINT_REL_READ_ONLY : SO_NONE);
/*
* If no run-time keys to calculate or they are ready, go ahead and pass
@@ -1798,7 +1801,8 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
node->iss_NumScanKeys,
node->iss_NumOrderByKeys,
piscan,
- SO_NONE);
+ ScanRelIsReadOnly(&node->ss) ?
+ SO_HINT_REL_READ_ONLY : SO_NONE);
/*
* If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/executor/nodeSamplescan.c b/src/backend/executor/nodeSamplescan.c
index cf32df33d82..f3d273e1c5e 100644
--- a/src/backend/executor/nodeSamplescan.c
+++ b/src/backend/executor/nodeSamplescan.c
@@ -299,7 +299,8 @@ tablesample_init(SampleScanState *scanstate)
scanstate->use_bulkread,
allow_sync,
scanstate->use_pagemode,
- SO_NONE);
+ ScanRelIsReadOnly(&scanstate->ss) ?
+ SO_HINT_REL_READ_ONLY : SO_NONE);
}
else
{
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 09ccc65de1c..04803b0e37d 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -72,7 +72,8 @@ SeqNext(SeqScanState *node)
scandesc = table_beginscan(node->ss.ss_currentRelation,
estate->es_snapshot,
0, NULL,
- SO_NONE);
+ ScanRelIsReadOnly(&node->ss) ?
+ SO_HINT_REL_READ_ONLY : SO_NONE);
node->ss.ss_currentScanDesc = scandesc;
}
@@ -375,9 +376,11 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
pscan,
estate->es_snapshot);
shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
+
node->ss.ss_currentScanDesc =
table_beginscan_parallel(node->ss.ss_currentRelation, pscan,
- SO_NONE);
+ ScanRelIsReadOnly(&node->ss) ?
+ SO_HINT_REL_READ_ONLY : SO_NONE);
}
/* ----------------------------------------------------------------
@@ -411,5 +414,6 @@ ExecSeqScanInitializeWorker(SeqScanState *node,
pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
node->ss.ss_currentScanDesc =
table_beginscan_parallel(node->ss.ss_currentRelation, pscan,
- SO_NONE);
+ ScanRelIsReadOnly(&node->ss) ?
+ SO_HINT_REL_READ_ONLY : SO_NONE);
}
diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c
index 084e4c6ec90..4a8fe91b2b3 100644
--- a/src/backend/executor/nodeTidrangescan.c
+++ b/src/backend/executor/nodeTidrangescan.c
@@ -246,7 +246,8 @@ TidRangeNext(TidRangeScanState *node)
estate->es_snapshot,
&node->trss_mintid,
&node->trss_maxtid,
- SO_NONE);
+ ScanRelIsReadOnly(&node->ss) ?
+ SO_HINT_REL_READ_ONLY : SO_NONE);
node->ss.ss_currentScanDesc = scandesc;
}
else
@@ -461,7 +462,9 @@ ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt)
shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
node->ss.ss_currentScanDesc =
table_beginscan_parallel_tidrange(node->ss.ss_currentRelation,
- pscan, SO_NONE);
+ pscan,
+ ScanRelIsReadOnly(&node->ss) ?
+ SO_HINT_REL_READ_ONLY : SO_NONE);
}
/* ----------------------------------------------------------------
@@ -495,5 +498,7 @@ ExecTidRangeScanInitializeWorker(TidRangeScanState *node,
pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
node->ss.ss_currentScanDesc =
table_beginscan_parallel_tidrange(node->ss.ss_currentRelation,
- pscan, SO_NONE);
+ pscan,
+ ScanRelIsReadOnly(&node->ss) ?
+ SO_HINT_REL_READ_ONLY : SO_NONE);
}
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index f8d1423b2d0..68ddabc171a 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -65,6 +65,9 @@ typedef enum ScanOptions
/* unregister snapshot at scan end? */
SO_TEMP_SNAPSHOT = 1 << 9,
+
+ /* set if the query doesn't modify the relation */
+ SO_HINT_REL_READ_ONLY = 1 << 10,
} ScanOptions;
/*
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 07f4b1f7490..7979a17e4ec 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -690,6 +690,8 @@ extern void ExecCreateScanSlotFromOuterPlan(EState *estate,
extern bool ExecRelationIsTargetRelation(EState *estate, Index scanrelid);
+extern bool ScanRelIsReadOnly(ScanState *ss);
+
extern Relation ExecOpenScanRelation(EState *estate, Index scanrelid, int eflags);
extern void ExecInitRangeTable(EState *estate, List *rangeTable, List *permInfos,
--
2.43.0
[text/x-patch] v48-0005-Allow-on-access-pruning-to-set-pages-all-visible.patch (9.9K, 6-v48-0005-Allow-on-access-pruning-to-set-pages-all-visible.patch)
download | inline diff:
From e914b4834e613c59935df55a400a9290cc145b33 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Fri, 27 Feb 2026 16:33:40 -0500
Subject: [PATCH v48 5/6] Allow on-access pruning to set pages all-visible
Many queries do not modify the underlying relation. For such queries, if
on-access pruning occurs during the scan, we can check whether the page
has become all-visible and update the visibility map accordingly.
Previously, only vacuum and COPY FREEZE marked pages as all-visible or
all-frozen.
This commit implements on-access VM setting for sequential scans as well
as for the underlying heap relation in index scans and bitmap heap
scans.
Setting the visibility map on-access can avoid write amplification
caused by vacuum later needing to set the page all-visible, trigger a
write and potentially FPI. It also allows more frequent index-only
scans, since they require pages to be marked all-visible in the VM.
Author: Melanie Plageman <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Kirill Reshke <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
---
src/backend/access/heap/heapam.c | 3 +-
src/backend/access/heap/heapam_handler.c | 6 ++-
src/backend/access/heap/pruneheap.c | 55 ++++++++++++++++++------
src/backend/access/heap/vacuumlazy.c | 2 +-
src/include/access/heapam.h | 3 +-
5 files changed, 52 insertions(+), 17 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index eb1f67f31cd..7012ee2c306 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -633,7 +633,8 @@ heap_prepare_pagescan(TableScanDesc sscan)
/*
* Prune and repair fragmentation for the whole page, if possible.
*/
- heap_page_prune_opt(scan->rs_base.rs_rd, buffer, &scan->rs_vmbuffer);
+ heap_page_prune_opt(scan->rs_base.rs_rd, buffer, &scan->rs_vmbuffer,
+ (sscan->rs_flags & SO_HINT_REL_READ_ONLY));
/*
* We must hold share lock on the buffer content while examining tuple
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 99280cd8159..3433ea93c11 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -149,7 +149,8 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
*/
if (prev_buf != hscan->xs_cbuf)
heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf,
- &hscan->xs_vmbuffer);
+ &hscan->xs_vmbuffer,
+ (hscan->xs_base.flags & SO_HINT_REL_READ_ONLY));
}
/* Obtain share-lock on the buffer so we can examine visibility */
@@ -2546,7 +2547,8 @@ BitmapHeapScanNextBlock(TableScanDesc scan,
/*
* Prune and repair fragmentation for the whole page, if possible.
*/
- heap_page_prune_opt(scan->rs_rd, buffer, &hscan->rs_vmbuffer);
+ heap_page_prune_opt(scan->rs_rd, buffer, &hscan->rs_vmbuffer,
+ scan->rs_flags & SO_HINT_REL_READ_ONLY);
/*
* We must hold share lock on the buffer content while examining tuple
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 6693af8da7f..7fcfc844d20 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -44,6 +44,8 @@ typedef struct
bool mark_unused_now;
/* whether to attempt freezing tuples */
bool attempt_freeze;
+ /* whether to attempt setting the VM */
+ bool attempt_set_vm;
struct VacuumCutoffs *cutoffs;
Relation relation;
@@ -232,7 +234,8 @@ static void page_verify_redirects(Page page);
static bool heap_page_will_freeze(bool did_tuple_hint_fpi, bool do_prune, bool do_hint_prune,
PruneState *prstate);
-static bool heap_page_will_set_vm(PruneState *prstate, PruneReason reason);
+static bool heap_page_will_set_vm(PruneState *prstate, PruneReason reason,
+ bool do_prune, bool do_freeze);
/*
@@ -251,9 +254,21 @@ static bool heap_page_will_set_vm(PruneState *prstate, PruneReason reason);
* reuse the pin across calls, avoiding repeated pin/unpin cycles. If we find
* VM corruption during pruning, we will fix it. Caller is responsible for
* unpinning *vmbuffer.
+ *
+ * rel_read_only is true if we determined at plan time that the query does not
+ * modify the relation. It is counterproductive to set the VM if the query
+ * will immediately clear it.
+ *
+ * As noted in ScanRelIsReadOnly(), INSERT ... SELECT on the same table will
+ * report the scan relation as read-only. This is usually harmless in
+ * practice. It is useful to set scanned pages all-visible that won't be
+ * inserted into. Pages we do insert to rarely meet the criteria for pruning,
+ * and those that do are likely to contain in-progress inserts which make the
+ * page not fully all-visible.
*/
void
-heap_page_prune_opt(Relation relation, Buffer buffer, Buffer *vmbuffer)
+heap_page_prune_opt(Relation relation, Buffer buffer, Buffer *vmbuffer,
+ bool rel_read_only)
{
Page page = BufferGetPage(buffer);
TransactionId prune_xid;
@@ -336,6 +351,8 @@ heap_page_prune_opt(Relation relation, Buffer buffer, Buffer *vmbuffer)
* current implementation.
*/
params.options = HEAP_PAGE_PRUNE_ALLOW_FAST_PATH;
+ if (rel_read_only)
+ params.options |= HEAP_PAGE_PRUNE_SET_VM;
heap_page_prune_and_freeze(¶ms, &presult, &dummy_off_loc,
NULL, NULL);
@@ -392,6 +409,7 @@ prune_freeze_setup(PruneFreezeParams *params,
/* cutoffs must be provided if we will attempt freezing */
Assert(!(params->options & HEAP_PAGE_PRUNE_FREEZE) || params->cutoffs);
prstate->attempt_freeze = (params->options & HEAP_PAGE_PRUNE_FREEZE) != 0;
+ prstate->attempt_set_vm = (params->options & HEAP_PAGE_PRUNE_SET_VM) != 0;
prstate->cutoffs = params->cutoffs;
prstate->relation = params->relation;
prstate->block = BufferGetBlockNumber(params->buffer);
@@ -461,9 +479,8 @@ prune_freeze_setup(PruneFreezeParams *params,
* We track whether the page will be all-visible/all-frozen at the end of
* pruning and freezing. While examining tuple visibility, we'll set
* set_all_visible to false if there are tuples on the page not visible to
- * all running and future transactions. set_all_visible is always
- * maintained but only VACUUM will set the VM if the page ends up being
- * all-visible.
+ * all running and future transactions. If enabled for this scan, we will
+ * set the VM if the page ends up being all-visible.
*
* We also keep track of the newest live XID, which is used to calculate
* the snapshot conflict horizon for a WAL record setting the VM.
@@ -920,21 +937,35 @@ heap_page_fix_vm_corruption(PruneState *prstate, OffsetNumber offnum,
* This function does not actually set the VM bits or page-level visibility
* hint, PD_ALL_VISIBLE.
*
+ * This should be called only after do_freeze has been decided (and do_prune
+ * has been set), as these factor into our heuristic-based decision.
+ *
* Returns true if one or both VM bits should be set and false otherwise.
*/
static bool
-heap_page_will_set_vm(PruneState *prstate, PruneReason reason)
+heap_page_will_set_vm(PruneState *prstate, PruneReason reason,
+ bool do_prune, bool do_freeze)
{
- /*
- * Though on-access pruning maintains prstate->set_all_visible, we don't
- * set the VM on-access for now.
- */
- if (reason == PRUNE_ON_ACCESS)
+ if (!prstate->attempt_set_vm)
return false;
if (!prstate->set_all_visible)
return false;
+ /*
+ * If this is an on-access call and we're not actually pruning, avoid
+ * setting the visibility map if it would newly dirty the heap page or, if
+ * the page is already dirty, if doing so would require including a
+ * full-page image (FPI) of the heap page in the WAL.
+ */
+ if (reason == PRUNE_ON_ACCESS && !do_prune && !do_freeze &&
+ (!BufferIsDirty(prstate->buffer) || XLogCheckBufferNeedsBackup(prstate->buffer)))
+ {
+ prstate->set_all_visible = false;
+ prstate->set_all_frozen = false;
+ return false;
+ }
+
prstate->new_vmbits = VISIBILITYMAP_ALL_VISIBLE;
if (prstate->set_all_frozen)
@@ -1167,7 +1198,7 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
Assert(!prstate.set_all_frozen || prstate.set_all_visible);
Assert(!prstate.set_all_visible || (prstate.lpdead_items == 0));
- do_set_vm = heap_page_will_set_vm(&prstate, params->reason);
+ do_set_vm = heap_page_will_set_vm(&prstate, params->reason, do_prune, do_freeze);
/*
* new_vmbits should be 0 regardless of whether or not the page is
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f698c2d899b..24001b27387 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2021,7 +2021,7 @@ lazy_scan_prune(LVRelState *vacrel,
.buffer = buf,
.vmbuffer = vmbuffer,
.reason = PRUNE_VACUUM_SCAN,
- .options = HEAP_PAGE_PRUNE_FREEZE,
+ .options = HEAP_PAGE_PRUNE_FREEZE | HEAP_PAGE_PRUNE_SET_VM,
.vistest = vacrel->vistest,
.cutoffs = &vacrel->cutoffs,
};
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e2e07348f37..f2a009141be 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -43,6 +43,7 @@
#define HEAP_PAGE_PRUNE_MARK_UNUSED_NOW (1 << 0)
#define HEAP_PAGE_PRUNE_FREEZE (1 << 1)
#define HEAP_PAGE_PRUNE_ALLOW_FAST_PATH (1 << 2)
+#define HEAP_PAGE_PRUNE_SET_VM (1 << 3)
typedef struct BulkInsertStateData *BulkInsertState;
typedef struct GlobalVisState GlobalVisState;
@@ -431,7 +432,7 @@ extern TransactionId heap_index_delete_tuples(Relation rel,
/* in heap/pruneheap.c */
extern void heap_page_prune_opt(Relation relation, Buffer buffer,
- Buffer *vmbuffer);
+ Buffer *vmbuffer, bool rel_read_only);
extern void heap_page_prune_and_freeze(PruneFreezeParams *params,
PruneFreezeResult *presult,
OffsetNumber *off_loc,
--
2.43.0
[text/x-patch] v48-0006-Set-pd_prune_xid-on-insert.patch (8.6K, 7-v48-0006-Set-pd_prune_xid-on-insert.patch)
download | inline diff:
From 13f3c314d760bce33ca48ea6d1cde606b62cad4c Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Tue, 29 Jul 2025 16:12:56 -0400
Subject: [PATCH v48 6/6] Set pd_prune_xid on insert
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Now that on-access pruning can update the visibility map (VM) during
read-only queries, set the page’s pd_prune_xid hint during INSERT and on
the new page during UPDATE.
This allows heap_page_prune_and_freeze() to set the VM the first time a
page is read after being filled with tuples. This may avoid I/O
amplification by setting the page all-visible when it is still in shared
buffers and allowing later vacuums to skip scanning the page. It also
enables index-only scans of newly inserted data much sooner.
As a side benefit, this addresses a long-standing note in heap_insert()
and heap_multi_insert(): aborted inserts can now be pruned on-access
rather than lingering until the next VACUUM.
Author: Melanie Plageman <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
---
src/backend/access/heap/heapam.c | 39 +++++++++++++++++----------
src/backend/access/heap/heapam_xlog.c | 19 ++++++++++++-
src/backend/access/heap/pruneheap.c | 18 ++++++-------
3 files changed, 51 insertions(+), 25 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7012ee2c306..3b020d910d7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2154,6 +2154,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
TransactionId xid = GetCurrentTransactionId();
HeapTuple heaptup;
Buffer buffer;
+ Page page;
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
@@ -2180,6 +2181,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
&vmbuffer, NULL,
0);
+ page = BufferGetPage(buffer);
+
/*
* We're about to do the actual insert -- but check for conflict first, to
* avoid possibly having to roll back work we've just done.
@@ -2203,25 +2206,30 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
RelationPutHeapTuple(relation, buffer, heaptup,
(options & HEAP_INSERT_SPECULATIVE) != 0);
- if (PageIsAllVisible(BufferGetPage(buffer)))
+ if (PageIsAllVisible(page))
{
all_visible_cleared = true;
- PageClearAllVisible(BufferGetPage(buffer));
+ PageClearAllVisible(page);
visibilitymap_clear(relation,
ItemPointerGetBlockNumber(&(heaptup->t_self)),
vmbuffer, VISIBILITYMAP_VALID_BITS);
}
/*
- * XXX Should we set PageSetPrunable on this page ?
+ * Set pd_prune_xid to trigger heap_page_prune_and_freeze() once the page
+ * is full so that we can set the page all-visible in the VM on the next
+ * page access.
*
- * The inserting transaction may eventually abort thus making this tuple
- * DEAD and hence available for pruning. Though we don't want to optimize
- * for aborts, if no other tuple in this page is UPDATEd/DELETEd, the
- * aborted tuple will never be pruned until next vacuum is triggered.
+ * Setting pd_prune_xid is also handy if the inserting transaction
+ * eventually aborts making this tuple DEAD and hence available for
+ * pruning. If no other tuple in this page is UPDATEd/DELETEd, the aborted
+ * tuple would never otherwise be pruned until next vacuum is triggered.
*
- * If you do add PageSetPrunable here, add it in heap_xlog_insert too.
+ * Don't set it if we are in bootstrap mode or we are inserting a frozen
+ * tuple, as there is no further pruning/freezing needed in those cases.
*/
+ if (TransactionIdIsNormal(xid) && !(options & HEAP_INSERT_FROZEN))
+ PageSetPrunable(page, xid);
MarkBufferDirty(buffer);
@@ -2231,7 +2239,6 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
xl_heap_insert xlrec;
xl_heap_header xlhdr;
XLogRecPtr recptr;
- Page page = BufferGetPage(buffer);
uint8 info = XLOG_HEAP_INSERT;
int bufflags = 0;
@@ -2596,8 +2603,12 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
}
/*
- * XXX Should we set PageSetPrunable on this page ? See heap_insert()
+ * Set pd_prune_xid. See heap_insert() for more on why we do this when
+ * inserting tuples. This only makes sense if we aren't already
+ * setting the page frozen in the VM and we're not in bootstrap mode.
*/
+ if (!all_frozen_set && TransactionIdIsNormal(xid))
+ PageSetPrunable(page, xid);
MarkBufferDirty(buffer);
@@ -4139,12 +4150,12 @@ l2:
* the subsequent page pruning will be a no-op and the hint will be
* cleared.
*
- * XXX Should we set hint on newbuf as well? If the transaction aborts,
- * there would be a prunable tuple in the newbuf; but for now we choose
- * not to optimize for aborts. Note that heap_xlog_update must be kept in
- * sync if this decision changes.
+ * We set the new page prunable as well. See heap_insert() for more on why
+ * we do this when inserting tuples.
*/
PageSetPrunable(page, xid);
+ if (newbuf != buffer)
+ PageSetPrunable(newpage, xid);
if (use_hot_update)
{
diff --git a/src/backend/access/heap/heapam_xlog.c b/src/backend/access/heap/heapam_xlog.c
index 1302bb13e18..f3f419d3dc1 100644
--- a/src/backend/access/heap/heapam_xlog.c
+++ b/src/backend/access/heap/heapam_xlog.c
@@ -450,6 +450,14 @@ heap_xlog_insert(XLogReaderState *record)
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
+ /*
+ * Set the page prunable to trigger on-access pruning later, which may
+ * set the page all-visible in the VM. See comments in heap_insert().
+ */
+ if (TransactionIdIsNormal(XLogRecGetXid(record)) &&
+ !HeapTupleHeaderXminFrozen(htup))
+ PageSetPrunable(page, XLogRecGetXid(record));
+
PageSetLSN(page, lsn);
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
@@ -599,12 +607,19 @@ heap_xlog_multi_insert(XLogReaderState *record)
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
- /* XLH_INSERT_ALL_FROZEN_SET implies that all tuples are visible */
+ /*
+ * XLH_INSERT_ALL_FROZEN_SET implies that all tuples are visible. If
+ * we are not setting the page frozen, then set the page's prunable
+ * hint so that we trigger on-access pruning later which may set the
+ * page all-visible in the VM.
+ */
if (xlrec->flags & XLH_INSERT_ALL_FROZEN_SET)
{
PageSetAllVisible(page);
PageClearPrunable(page);
}
+ else
+ PageSetPrunable(page, XLogRecGetXid(record));
MarkBufferDirty(buffer);
}
@@ -921,6 +936,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
freespace = PageGetHeapFreeSpace(npage);
PageSetLSN(npage, lsn);
+ /* See heap_insert() for why we set pd_prune_xid on insert */
+ PageSetPrunable(npage, XLogRecGetXid(record));
MarkBufferDirty(nbuffer);
}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 7fcfc844d20..fe9b1f16db4 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -286,7 +286,8 @@ heap_page_prune_opt(Relation relation, Buffer buffer, Buffer *vmbuffer,
/*
* First check whether there's any chance there's something to prune,
* determining the appropriate horizon is a waste if there's no prune_xid
- * (i.e. no updates/deletes left potentially dead tuples around).
+ * (i.e. no updates/deletes left potentially dead tuples around and no
+ * inserts inserted new tuples that may be visible to all).
*/
prune_xid = PageGetPruneXid(page);
if (!TransactionIdIsValid(prune_xid))
@@ -1927,17 +1928,14 @@ heap_prune_record_unchanged_lp_normal(PruneState *prstate, OffsetNumber offnum)
prstate->set_all_visible = false;
prstate->set_all_frozen = false;
- /* The page should not be marked all-visible */
- if (PageIsAllVisible(page))
- heap_page_fix_vm_corruption(prstate, offnum,
- VM_CORRUPT_TUPLE_VISIBILITY);
-
/*
- * If we wanted to optimize for aborts, we might consider marking
- * the page prunable when we see INSERT_IN_PROGRESS. But we
- * don't. See related decisions about when to mark the page
- * prunable in heapam.c.
+ * Though there is nothing "prunable" on the page, we maintain
+ * pd_prune_xid for inserts so that we have the opportunity to
+ * mark them all-visible during the next round of pruning.
*/
+ heap_prune_record_prunable(prstate,
+ HeapTupleHeaderGetXmin(htup),
+ offnum);
break;
case HEAPTUPLE_DELETE_IN_PROGRESS:
--
2.43.0
view thread (144+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
Subject: Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)
In-Reply-To: <CAAKRu_atv6zA274m8Ysgbfn49c0NbdvHT7nXvd9kroZKnFq8Dg@mail.gmail.com>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox