public inbox for [email protected]  
help / color / mirror / Atom feed
From: Melanie Plageman <[email protected]>
To: Andres Freund <[email protected]>
Cc: Kirill Reshke <[email protected]>
Cc: Chao Li <[email protected]>
Cc: Andrey Borodin <[email protected]>
Cc: Xuneng Zhou <[email protected]>
Cc: Robert Haas <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Cc: Heikki Linnakangas <[email protected]>
Subject: Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)
Date: Thu, 19 Mar 2026 22:38:47 -0400
Message-ID: <CAAKRu_bsHbvt+VqbjHXsdphKf8fqwBEutRhH3fmo+qUVe=yBKw@mail.gmail.com> (raw)
In-Reply-To: <jlsvov4o5xswxjvjhvuchz6y55ncvoc457grvxct7cub5gcxuj@4e2toesujnpr>
References: <CAAKRu_a7ByG2LipFADR7UYnLP4BWQKOV1sajqJC4=R37iO05+A@mail.gmail.com>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<CAAKRu_a+hO4PCptyaPR7AMZd7FjcHfOFKKJT8ouU3KedMud0tQ@mail.gmail.com>
	<CAAKRu_Z8Ry_ynNBPAzs_Ry3MQi9NaBgt1ccLgwRsDbxWpocaBg@mail.gmail.com>
	<CAAKRu_ZbOp52rnkjf63h5mf94raEKBH7AAbz6QTx-xdH9yLfmQ@mail.gmail.com>
	<CAAKRu_b8m+iuupm4ZX+2_V5Xj5u4jCTrU=Tv=epg6p4H2SMkFQ@mail.gmail.com>
	<CALdSSPh9hVXNiPwdntWqbMzu5upKy0jBDDe7Un0_Nf2A54R2VQ@mail.gmail.com>
	<CAAKRu_a6Cd7JnxhY4A=b_Paxc3UDUDOPeqV3GbzMh=R2KkD-uQ@mail.gmail.com>
	<jlsvov4o5xswxjvjhvuchz6y55ncvoc457grvxct7cub5gcxuj@4e2toesujnpr>

Thanks for the detailed review! Unless otherwise specified, attached
v41 includes all of your straightforward review points.

On Wed, Mar 18, 2026 at 1:14 PM Andres Freund <[email protected]> wrote:
>
> > +                     params.relation = relation;
> > +                     params.buffer = buffer;
> > +                     params.vmbuffer = *vmbuffer;
> > +                     params.reason = PRUNE_ON_ACCESS;
> > +                     params.vistest = vistest;
> > +                     params.cutoffs = NULL;
> >
>
> >                        * We don't pass the HEAP_PAGE_PRUNE_MARK_UNUSED_NOW option
> > @@ -284,14 +312,7 @@ heap_page_prune_opt(Relation relation, Buffer buffer, Buffer *vmbuffer)
> >                        * cannot safely determine that during on-access pruning with the
> >                        * current implementation.
> >                        */
> > -                     PruneFreezeParams params = {
> > -                             .relation = relation,
> > -                             .buffer = buffer,
> > -                             .reason = PRUNE_ON_ACCESS,
> > -                             .options = 0,
> > -                             .vistest = vistest,
> > -                             .cutoffs = NULL,
> > -                     };
> > +                     params.options = 0;
> >
> >                       heap_page_prune_and_freeze(&params, &presult, &dummy_off_loc,
> >                                                                          NULL, NULL);
>
> Why does this change the way the PruneFreezeParams variable is defined?  I
> don't really mind, it's just a bit confusing.

I couldn't use the designated initializer after visibilitymap_pin()
and I thought it was worse to have the designated initializer
nitialize vmbuffer to InvalidBuffer and then have to set vmbuffer to
the real vmbuffer  after visibilitymap_pin().

> > + * Helper to fix visibility-related corruption on a heap page and its
> > + * corresponding VM page. An all-visible page cannot have dead items nor can
> > + * it have tuples that are not visible to all running transactions. It clears
> > + * the VM corruption as well as resetting the vmbits used during pruning.
>
> So this is now only called when we already know there's corruption?  I think
> that could be clearer in the comments.
>
> Seems a bit odd that the function then figures out what it should do from the
> page & VM contents, given that the caller already needs to have known that
> something is wrong?

Yea, it was all a bit off. I agree. I've tried something new and made
a VMCorruptionType enum for the caller to pass in which tells this
function what to do (clear PD_ALL_VISIBLE and/or clear VM) and what
warning to emit.

> > +static void
> > +heap_fix_vm_corruption(PruneState *prstate, OffsetNumber offnum)
> > +{
> > +             {
> > +                     ereport(WARNING,
> > +                                     (errcode(ERRCODE_DATA_CORRUPTED),
> > +                                      errmsg("tuple not visible to all transactions found on page marked all-visible"),
> > +                                      errcontext("relation \"%s\", page %u, tuple %u",
> > +                                                             relname, prstate->block, offnum)));
> > +             }
>
> Wait, why are we now WARNING about the PageIsAllVisible() &&
> prstate->lpdead_items == 0 case? Seems to run flatly counter to the comment
> above about GetOldestNonRemovableTransactionId() going backward?

Only if the page has tuples with HTSV_Result
HEAPTUPLE_RECENTLY_DEAD/DELETE_IN_PROGRESS/INSERT_IN_PROGRESS. Even if
GetOldestNonRemovableTransactionId() goes backwards that should only
make it so that xids we previously thought were visible now show as
not visible to all. But those have to be HEAPTUPLE_LIVE tuples. We
should never thought it was all-visible if there were in-progress
deletes/inserts. So, I think it is okay. Now (in v41), the caller
would need to pass VM_CORRUPT_TUPLE_VISIBILITY and intend to emit the
warning.

> > +     else if (prstate->vmbits & VISIBILITYMAP_VALID_BITS)
> > +     {
> > +             /*
> > +              * As of PostgreSQL 9.2, the visibility map bit should never be set if
> > +              * the page-level bit is clear. However, for vacuum, it's possible
> > +              * that the bit got cleared after heap_vac_scan_next_block() was
> > +              * called, so we must recheck now that we have the buffer lock before
> > +              * concluding that the VM is corrupt.
> > +              */
> > +             ereport(WARNING,
> > +                             (errcode(ERRCODE_DATA_CORRUPTED),
> > +                              errmsg("page is not marked all-visible but visibility map bit is set"),
> > +                              errcontext("relation \"%s\", page %u",
> > +                                                     relname, prstate->block)));
> > +     }
> > +
> > +     visibilitymap_clear(prstate->relation, prstate->block, prstate->vmbuffer,
> > +                                             VISIBILITYMAP_VALID_BITS);
> > +     prstate->vmbits = 0;
>
> So we can end up clearing the VM without emitting any warning?

This was me trying to avoid duplicating code in the branches. In v41,
I error out if the caller doesn't specify a valid corruption type, so
anything that clears the VM will have emitted a warning.

> > +static void
> > +heap_page_bypass_prune_freeze(PruneState *prstate, PruneFreezeResult *presult)
> > +{
> > +     OffsetNumber maxoff = PageGetMaxOffsetNumber(prstate->page);
> > +     Page            page = prstate->page;
> > +
> > +     Assert(prstate->vmbits & VISIBILITYMAP_ALL_FROZEN ||
> > +                (prstate->vmbits & VISIBILITYMAP_ALL_VISIBLE &&
> > +                     !prstate->attempt_freeze));
> > +
> > +     /* We'll fill in presult for the caller */
> > +     memset(presult, 0, sizeof(PruneFreezeResult));
> > +
> > +     presult->vmbits = prstate->vmbits;
> > +
> > +     /* Clear any stale prune hint */
> > +     if (TransactionIdIsValid(PageGetPruneXid(page)))
> > +     {
> > +             PageClearPrunable(page);
> > +             MarkBufferDirtyHint(prstate->buffer, true);
> > +     }
> > +
> > +     if (PageIsEmpty(page))
> > +             return;
> > +
> > +     presult->hastup = true;
>
> Is that actually a given? Couldn't the page consist solely out of unused
> items? That'd make PageIsEmpty() return false, but should still allow
> truncation.

Good point. I've changed it to set hastup when counting live tuples.

But should I set hastup it if I see an LP_REDIRECT pointer? I know I
should always see a LP_NORMAL pointer if I see an LP_REDIRECT pointer,
but I just wondered if I should explicitly set hastup when I see
LP_REDIRECT since heap_prune_record_redirect() sets hastup = true.

> > diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
>
> > +     /*
> > +      * After processing all the live tuples on the page, if the newest xmin
> > +      * amongst them may be considered running by any snapshot, the page cannot
> > +      * be all-visible.
> > +      */
> > +     if (prstate.set_all_visible &&
> > +             TransactionIdIsNormal(prstate.visibility_cutoff_xid) &&
> > +             GlobalVisTestXidMaybeRunning(prstate.vistest,
> > +                                                                      prstate.visibility_cutoff_xid))
> > +             prstate.set_all_visible = prstate.set_all_frozen = false;
> > +
>
> So the docs for prstate.visibility_cutoff_xid say:
>
>          * visibility_cutoff_xid is the newest xmin of live tuples on the page.
>          * The caller can use it as the conflict horizon, when setting the VM
>          * bits.  It is only valid if we froze some tuples, and set_all_frozen is
>          * true.
>
> But here we look at it without checking that we froze some tuples.  I guess
> the comment is outdated?

That comment was never correct -- or I have chopped it into
unrecognizable bits over the last two years.

> Could the "going backward" thing possibly trigger a spurious assert in
>
>         Assert(heap_page_is_all_visible(vacrel->rel, buf,
>                                         vacrel->vistest, &debug_all_frozen,
>                                         &debug_cutoff, &vacrel->offnum));

I don't think anything (today) updates GlobalVisState between
GlobalVisTestXidMaybeRunning() and the heap_page_is_all_visible()
assert.

I had removed the visibility_cutoff_xid part of the assertion on the
intuition that comparing an exact horizon would no longer work when
using GlobalVisState. I can't remember if I actually saw failing
tests, but I don't see them anymore (so I've put it back).

The heap_page_is_all_visible() assertion moves into
heap_page_prune_and_freeze() in a later patch in this set, and while
it is also in a place where I don't think GlobalVisState can have
moved between making the page changes and calling
heap_page_is_all_visible(), I suspect it won't be a totally reliable
assertion now that it uses a moving target for comparison. What do you
think?

> > From a1d768a8cea8ac13e250188ec96c01d98acda94a Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <[email protected]>
> > Date: Sat, 28 Feb 2026 16:06:51 -0500
> > Subject: [PATCH v40 04/12] Keep newest live XID up-to-date even if page not
> >  all-visible
>
> I guess I'd have expected 03 and 04 to be swapped... But whatever.

It couldn't be because I used GlobalVisState to always keep it
up-to-date (even for on-access pruning).

> > @@ -1076,6 +1116,30 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
> >               prstate.set_all_visible = prstate.set_all_frozen = false;
> >
> >       Assert(!prstate.set_all_frozen || prstate.set_all_visible);
> > +     Assert(!prstate.set_all_visible || (prstate.lpdead_items == 0));
>
> Why didn't we have this assert earlier?

It was in lazy_scan_prune() as:
    Assert(!presult.set_all_visible || !(*has_lpdead_items));

> > +     do_set_vm = heap_page_will_set_vm(&prstate, params->reason);
>
> Most of the other heap_page_prune_and_freeze() helpers are named
> heap_prune_xyz(), why not follow that here?
>
> I guess this holds for a few other helpers added in earlier commits
> too. E.g. heap_page_bypass_prune_freeze() should probably be
> heap_prune_bypass_prune_freeze() or such.

Most of the helpers prefixed with "heap_prune" now directly do
something related to pruning like recording line pointers and
traversing hot chains. heap_page_will_set_vm() and
heap_page_will_freeze() have nothing to do with pruning, so I think it
makes sense they are named differently.

And I don't think we are in any danger of folks using functions not
prefixed with heap_prune for other purposes, given that most of them
take a PruneState as an argument.

I'll do a big rename if you feel strongly about it, though.

> > @@ -1097,14 +1161,17 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
> >
> >               /*
> >                * If that's all we had to do to the page, this is a non-WAL-logged
> > -              * hint.  If we are going to freeze or prune the page, we will mark
> > -              * the buffer dirty below.
> > +              * hint.  If we are going to freeze or prune the page or set
> > +              * PD_ALL_VISIBLE, we will mark the buffer dirty below.
> > +              *
> > +              * Setting PD_ALL_VISIBLE is fully WAL-logged because it is forbidden
> > +              * for the VM to be set and PD_ALL_VISIBLE to be clear.
> >                */
> > -             if (!do_freeze && !do_prune)
> > +             if (!do_freeze && !do_prune && !do_set_vm)
> >                       MarkBufferDirtyHint(prstate.buffer, true);
> >       }
>
> This block is gated by if (do_hint_prune) which is computed as:
>
>         /*
>          * Even if we don't prune anything, if we found a new value for the
>          * pd_prune_xid field or the page was marked full, we will update the hint
>          * bit.
>          */
>         do_hint_prune = PageGetPruneXid(prstate.page) != prstate.new_prune_xid ||
>                 PageIsFull(prstate.page);
>
> It's not really related to this change, but I'm just confused a bit by the
> "|| PageIsFull(prstate.page)". What is that about? Why do we want to mark the
> buffer DirtyHint if the page is full? It very well might already have been
> marked as such, no?

Because if the page is marked full, we clear that hint, and, if that's
the only change we make to the page, we need to do
MarkBufferDirtyHint().

> > +#ifdef USE_ASSERT_CHECKING
> > +     if (prstate.set_all_visible)
> > +     {
> > +             TransactionId debug_cutoff;
> > +             bool            debug_all_frozen;
> > +
> > +             Assert(prstate.lpdead_items == 0);
> > +
> > +             Assert(heap_page_is_all_visible(prstate.relation, prstate.buffer,
> > +                                                                             prstate.vistest,
> > +                                                                             &debug_all_frozen,
> > +                                                                             &debug_cutoff, off_loc));
> > +
> > +             /*
> > +              * It's possible the page is composed entirely of frozen tuples but is
> > +              * not set all-frozen in the VM and did not pass
> > +              * HEAP_PAGE_PRUNE_FREEZE. In this case, it's possible
> > +              * heap_page_is_all_visible() finds the page completely frozen, even
> > +              * though prstate.set_all_frozen is false.
> > +              */
> > +             Assert(!prstate.set_all_frozen || debug_all_frozen);
>
> Seems like we could verify that debug_cutoff isn't newer than conflict_xid?

Well, not conflict_xid, but newest_xid, yes.

> Hm.  I guess aborting after we did incorrect pruning/freezing/VMing is better
> than not, but it'd be even better if we did it before corrupting things. But I
> guess it'd be not trivial to add something like the debug_cutoff assertion I
> suggest above, when freezing of tuples is only executed after
> heap_page_is_all_visible() (for dead tuples heap_page_would_be_all_visible()
> already has provisions).
>
> It's probably more a theoretical concern than a real worry.

Yea, I think the work it would take to make
heap_page_would_be_all_visible() work for frozen tuples wouldn't be
worth it just to get it to assert out before executing the page
changes.

> > +     presult->new_all_visible_pages = 0;
> > +     presult->new_all_frozen_pages = 0;
> > +     presult->new_all_visible_frozen_pages = 0;
>
> Isn't it odd to talk about pages here? Given that heap_page_prune_and_freeze()
> only ever operates on exactly one page.  Is that just so you can do
>
> > +     vacrel->new_all_visible_pages += presult.new_all_visible_pages;

I made this change because you didn't like it when I passed old_vmbits
and new_vmbits back out to lazy_scan_prune() to derive these counters.
FWIW I think it's better not to have lazy_scan_prune() compare new and
old vmbits to increment counters, because lazy_scan_prune() shouldn't
have to know about the VM anymore once it is not setting it.

> > +     if (do_set_vm)
> > +     {
> > +             if ((prstate.old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
> > +             {
> > +                     presult->new_all_visible_pages = 1;
> > +                     if (prstate.set_all_frozen)
> > +                             presult->new_all_visible_frozen_pages = 1;
> > +             }
> > +             else if ((prstate.old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
> > +                              prstate.set_all_frozen)
> > +                     presult->new_all_frozen_pages = 1;
> > +     }
> > +
> >       if (prstate.attempt_freeze)
> >       {
> >               if (presult->nfrozen > 0)
>
> Feels like this is kinda redoing what heap_page_will_set_vm already did.

The logic is different than what is in heap_page_will_set_vm() because
there we don't care about what old_vmbits is. We are simply concerned
with whether we should set new_vmbits to something.

So we need to have logic somewhere that is figuring out if the vmbits
were set before and whether we newly set them. That can either go in
heap_page_prune_and_freeze() and we can use that to set the counters
in the LVRelState or it can go in lazy_scan_prune().

I think it makes more sense in heap_page_prune_and_freeze() so that
lazy_scan_prune() doesn't have to know about the VM's new/old state,
which it otherwise no longer deals with.


> > @@ -1923,13 +1926,33 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
> >                       PageSetAllVisible(page);
> >                       PageClearPrunable(page);
> > -                     visibilitymap_set(vacrel->rel, blkno, buf,
> > -                                                       InvalidXLogRecPtr,
> > -                                                       vmbuffer, InvalidTransactionId,
> > -                                                       VISIBILITYMAP_ALL_VISIBLE |
> > -                                                       VISIBILITYMAP_ALL_FROZEN);
> > +                     visibilitymap_set_vmbits(blkno,
> > +                                                                      vmbuffer,
> > +                                                                      VISIBILITYMAP_ALL_VISIBLE |
> > +                                                                      VISIBILITYMAP_ALL_FROZEN,
> > +                                                                      vacrel->rel->rd_locator);
> > +
> > +                     /*
> > +                      * Emit WAL for setting PD_ALL_VISIBLE on the heap page and
> > +                      * setting the VM.
> > +                      */
> > +                     if (RelationNeedsWAL(vacrel->rel))
> > +                             log_heap_prune_and_freeze(vacrel->rel, buf,
> > +                                                                               vmbuffer,
> > +                                                                               VISIBILITYMAP_ALL_VISIBLE |
> > +                                                                               VISIBILITYMAP_ALL_FROZEN,
> > +                                                                               InvalidTransactionId, /* conflict xid */
> > +                                                                               false,        /* cleanup lock */
> > +                                                                               PRUNE_VACUUM_SCAN,    /* reason */
> > +                                                                               NULL, 0,
> > +                                                                               NULL, 0,
> > +                                                                               NULL, 0,
> > +                                                                               NULL, 0);
> > +
> >                       END_CRIT_SECTION();
>
> It's a tad odd that we do:
>
>                         /*
>                          * It's possible that another backend has extended the heap,
>                          * initialized the page, and then failed to WAL-log the page due
>                          * to an ERROR.  Since heap extension is not WAL-logged, recovery
>                          * might try to replay our record setting the page all-visible and
>                          * find that the page isn't initialized, which will cause a PANIC.
>                          * To prevent that, check whether the page has been previously
>                          * WAL-logged, and if not, do that now.
>                          */
>                         if (RelationNeedsWAL(vacrel->rel) &&
>                                 !XLogRecPtrIsValid(PageGetLSN(page)))
>                                 log_newpage_buffer(buf, true);
>
> if we then immediately afterwards emit a WAL record that could just as well
> have included in FPI of the heap page.

I originally added a flag to log_heap_prune_and_freeze() that could
force an FPI but Robert disliked it, saying he found it more
confusing. He said:

> 0004. It is not clear to me why you need to get
> log_heap_prune_and_freeze to do the work here. Why can't
> log_newpage_buffer get the job done already?

I can put it back that way, I don't have strong feelings either way.
Though I imagine if I add another argument to
log_heap_prune_and_freeze(), you'll bring up creating a struct for its
arguments again...


> > diff --git a/src/backend/access/common/bufmask.c b/src/backend/access/common/bufmask.c
> > index 8a67bfa1aff..d9042e1f91d 100644
> > --- a/src/backend/access/common/bufmask.c
> > +++ b/src/backend/access/common/bufmask.c
> > @@ -56,8 +56,8 @@ mask_page_hint_bits(Page page)
> >
> >       /*
> >        * During replay, if the page LSN has advanced past our XLOG record's LSN,
> > -      * we don't mark the page all-visible. See heap_xlog_visible() for
> > -      * details.
> > +      * we don't mark the page all-visible. See heap_xlog_prune_freeze() for
> > +      * more details.
> >        */
> >       PageClearAllVisible(page);
> >  }
>
> Not introduced by your change, but isn't it rather terrifying that the
> wal_consistency_checking infrastructure doesn't verify whether the page is
> marked all-visible? Wasn't aware of this. Seems bonkers to me.

Agreed. I wonder what it would take to start.

> I don't even know what specifically in heap_xlog_visible() that comment is
> referring to? Just that we only do PageSetAllVisible() if BLK_NEEDS_REDO? But
> uh, what does that have to do with anything?

Yea, this comment doesn't make sense. I think we should remove it.

But regarding why we mask PD_ALL_VISIBLE in wal consistency checking,
I wonder if this is the scenario:

Record 1 sets the VM and PD_ALL_VISIBLE
Record 2 inserts a tuple and clears PD_ALL_VISIBLE
the heap page is flushed to disk, but the VM page is not
crash
replay R1; skip setting PD_ALL_VISIBLE because the page has R2's LSN;
set the bits on the VM page

Even though we'll clear the VM when we replay R2, if we cross-check
the page and VM after replaying only R1, the VM will be set and
PD_ALL_VISIBLE will be clear. I think this is okay because no one
should see them at this time. But it might not work with wal
consistency checking.

> > @@ -477,12 +477,12 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
> >        * If we get passed InvalidTransactionId then we do nothing (no conflict).
> >        *
> >        * This can happen when replaying already-applied WAL records after a
> > -      * standby crash or restart, or when replaying an XLOG_HEAP2_VISIBLE
> > -      * record that marks as frozen a page which was already all-visible.  It's
> > -      * also quite common with records generated during index deletion
> > -      * (original execution of the deletion can reason that a recovery conflict
> > -      * which is sufficient for the deletion operation must take place before
> > -      * replay of the deletion record itself).
> > +      * standby crash or restart
>
> Again not about your patch: I don't understand how already applied WAL can
> lead to InvalidTransactionId being passed here. The record doesn't change just
> because we had already applied the WAL?

Yea, I think the comment is just wrong. I realized the comment still
needed to reference my code, so I've updated it.

> > From 04b03c1ec3abcee75e464fef994b482df41b35f4 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <[email protected]>
> > Date: Wed, 3 Dec 2025 15:07:24 -0500
> > Subject: [PATCH v40 08/12] Track which relations are modified by a query
> >
> > Save the relids of modified relations in a bitmap in the executor state.
> > A later commit will pass this information down to scan nodes to control
> > whether or not on-access pruning is allowed to set the visibility map.
> > Setting the visibility map during a scan is counterproductive if the
> > query is going to modify the page immediately after.
> >
> > Relations are considered modified if they are the target of INSERT,
> > UPDATE, DELETE, or MERGE, or if they have any row mark (including SELECT
> > FOR UPDATE/SHARE). All row mark types are included, even those which
> > don't actually modify tuples, because this bitmap is only used as a hint
> > to avoid unnecessary work.
>
> You're probably going to hate me for the question, but is there a reason to
> not compute es_modified_relids at plan time?

Yea, it probably does make more sense there. The only thing is that by
doing it in planner, it could include relids of leaf partitions that
get run-time pruned. But we won't scan those, so it is no issue for
this feature. I'm just wondering if it dilutes the meaning of
"modified relids", though.

In v41, I've implemented it in planner (which also made me realize
parallel workers previously didn't have es_modified_relids, oops).

> > @@ -992,6 +996,10 @@ InitPlan(QueryDesc *queryDesc, int eflags)
> >        */
> >       planstate = ExecInitNode(plan, estate, eflags);
> >
> > +#ifdef USE_ASSERT_CHECKING
> > +     CrossCheckModifiedRelids(estate);
> > +#endif
>
> Not sure that buys you much, given it pretty much is just a restatement of the
> code building estate->es_modified_relids.

Yea, now that I've done it in planner, I cross-check in the executor.

> What about checking against PlannedStmt->{resultRelations, permInfos} or

I don't think it makes sense to use permInfos because according to
expand_single_inheritance_child() there is no permission checking for
child RTEs, so I think permInfos won't include everything we need.

> asserting membership at the places that actually lock/modify?

Are you thinking I should also add some in ExecInsert, ExecDelete,
ExecUpdate, and ExecLockRows? Think this might be redundant with the
executor cross-check I have now after InitPlan(). (I've done it anyway
so we can discuss).

> > diff --git a/src/include/access/genam.h b/src/include/access/genam.h
> > index 1a27bf060b3..db102803eb5 100644
> > --- a/src/include/access/genam.h
> > +++ b/src/include/access/genam.h
> > @@ -158,7 +158,7 @@ extern IndexScanDesc index_beginscan(Relation heapRelation,
> >                                                                        Relation indexRelation,
> >                                                                        Snapshot snapshot,
> >                                                                        IndexScanInstrumentation *instrument,
> > -                                                                      int nkeys, int norderbys);
> > +                                                                      int nkeys, int norderbys, uint32 flags);
>
> I'd probably put flags in a position where it's not as easily confused with
> nkeys or norderbys.

Do you mean like move it before nkeys and norderbys or move it
earlier? I did the latter but not sure if it's weird to have flags
before snapshot (especially since the other table am routines pass it
last). I think it looks kind of weird when all of the other ones have
flags as the last argument.

> > @@ -1059,10 +1062,11 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
> >  static inline TableScanDesc
> >  table_beginscan_tidrange(Relation rel, Snapshot snapshot,
> >                                                ItemPointer mintid,
> > -                                              ItemPointer maxtid)
> > +                                              ItemPointer maxtid, uint32 flags)
> >  {
> >       TableScanDesc sscan;
> > -     uint32          flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
> > +
> > +     flags |= SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
> >
> >       sscan = table_beginscan_common(rel, snapshot, 0, NULL, NULL, flags);
>
> Hm. Would it perhaps be a good idea to have an assert as to which flags are
> specified by the "user"? If e.g. another SO_TYPE_* were specified it might
> result in some odd behaviour.
>
> Perhaps this would be best done by adding an argument to
> table_beginscan_common() specifying the "internal" flags (i.e. the ones that
> specified inside table_beginscan_*) and user specified flags?  Then
> table_beginscan_common could check the set of user specified flags being sane.

Yes, good idea. Done in attached v41.

It's unclear to me which flags should be considered internal though. I
think it makes sense that the SO_TYPE* flags are considered internal
because you can only specify one.  But all of the other current
ScanOptions are specified inside table_beginscan_* so do you mean that
we should consider all of those internal flags?


> > +      * Some optimizations can only be performed if the query does not modify
> > +      * the underlying relation. Track that here.
> > +      */
> > +     bool            modifies_base_rel;
> >  } IndexFetchHeapData;
>
> Wonder if this should be in the generic IndexFetchTableData?

I added flags to the IndexFetchTableData in much the same way as the
regular table scan descriptor has them.

> > diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
> > index d264a698ff6..a5536ba4ff6 100644
> > --- a/src/test/recovery/t/035_standby_logical_decoding.pl
> > +++ b/src/test/recovery/t/035_standby_logical_decoding.pl
> > @@ -296,6 +296,7 @@ wal_level = 'logical'
> >  max_replication_slots = 4
> >  max_wal_senders = 4
> >  autovacuum = off
> > +hot_standby_feedback = on
> >  });
> >  $node_primary->dump_info;
> >  $node_primary->start;
> > @@ -748,7 +749,7 @@ check_pg_recvlogical_stderr($handle,
> >  $logstart = -s $node_standby->logfile;
> >
> >  reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_',
> > -     'no_conflict_', 0, 1);
> > +     'no_conflict_', 1, 0);
> >
> >  # This should not trigger a conflict
> >  wait_until_vacuum_can_remove(
> > --
> > 2.43.0
>
> Why does this patch need to change anything here? Is the test buggy
> independently?

Nope. I guess that was a mistake during development. No change needed.

> > @@ -1863,16 +1864,14 @@ heap_prune_record_unchanged_lp_normal(PruneState *prstate, OffsetNumber offnum)
> >                       prstate->set_all_visible = false;
> >                       prstate->set_all_frozen = false;
> >
> > -                     /* The page should not be marked all-visible */
> > -                     if (PageIsAllVisible(page))
> > -                             heap_fix_vm_corruption(prstate, offnum);
> > -
>
> Huh?

heap_prune_record_prunable() already does the corruption check, so I
don't need to do it separately for INSERT_IN_PROGRESS tuples once we
call heap_prune_record_prunable() for them.

- Melanie


Attachments:

  [text/x-patch] v41-0001-Fix-visibility-map-corruption-in-more-cases.patch (20.5K, 2-v41-0001-Fix-visibility-map-corruption-in-more-cases.patch)
  download | inline diff:
From c87dd03ec309e12247c8ccdc3adf289a1e451255 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Wed, 25 Feb 2026 16:23:09 -0500
Subject: [PATCH v41 01/12] Fix visibility map corruption in more cases

Move VM corruption detection and repair into pruning. This allows VM
repair during on-access pruning, not only during vacuum.

Also, expand corruption detection to cover pages marked all-visible that
contain dead tuples and tuples inserted or updated by in-progress
transactions, rather than only all-visible pages with LP_DEAD items.

Pinning the correct VM page before on-access pruning is cheap when
compared to the cost of actually pruning. The vmbuffer is saved in the
scan descriptor, so a query should only need to pin each VM page once
and a single VM page covers a large number of heap pages.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Reviewed-by: Kirill Reshke <[email protected]>
Discussion: https://postgr.es/m/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk
---
 src/backend/access/heap/pruneheap.c  | 215 +++++++++++++++++++++++++--
 src/backend/access/heap/vacuumlazy.c |  89 +----------
 src/include/access/heapam.h          |  12 ++
 src/tools/pgindent/typedefs.list     |   1 +
 4 files changed, 215 insertions(+), 102 deletions(-)

diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 8d9f0694206..e452d25cae6 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -19,7 +19,7 @@
 #include "access/htup_details.h"
 #include "access/multixact.h"
 #include "access/transam.h"
-#include "access/visibilitymapdefs.h"
+#include "access/visibilitymap.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
@@ -114,6 +114,21 @@ typedef struct
 	 */
 	HeapPageFreeze pagefrz;
 
+	/*-------------------------------------------------------
+	 * Working state for visibility map processing
+	 *-------------------------------------------------------
+	 */
+
+	/*
+	 * Caller must provide a pinned vmbuffer corresponding to the heap block
+	 * passed to heap_page_prune_and_freeze(). We will fix any corruption
+	 * found in the VM.
+	 */
+	Buffer		vmbuffer;
+
+	/* Bits in the vmbuffer for this heap page */
+	uint8		old_vmbits;
+
 	/*-------------------------------------------------------
 	 * Information about what was done
 	 *
@@ -162,12 +177,30 @@ typedef struct
 	TransactionId visibility_cutoff_xid;
 } PruneState;
 
+
+/*
+ * Type of visibility map corruption detected on a heap page.  Passed to
+ * heap_page_fix_vm_corruption() so the caller can specify what it found rather
+ * than having the function re-derive the corruption from page state.
+ */
+typedef enum VMCorruptionType
+{
+	/* VM bits are set but the page-level PD_ALL_VISIBLE flag is not */
+	VM_CORRUPT_MISSING_PAGE_HINT,
+	/* LP_DEAD line pointers found on a page marked all-visible */
+	VM_CORRUPT_LPDEAD,
+	/* Tuple not visible to all transactions on a page marked all-visible */
+	VM_CORRUPT_TUPLE_VISIBILITY,
+} VMCorruptionType;
+
 /* Local functions */
 static void prune_freeze_setup(PruneFreezeParams *params,
 							   TransactionId *new_relfrozen_xid,
 							   MultiXactId *new_relmin_mxid,
 							   PruneFreezeResult *presult,
 							   PruneState *prstate);
+static void heap_page_fix_vm_corruption(PruneState *prstate, OffsetNumber offnum,
+										VMCorruptionType ctype);
 static void prune_freeze_plan(PruneState *prstate,
 							  OffsetNumber *off_loc);
 static HTSV_Result heap_prune_satisfies_vacuum(PruneState *prstate,
@@ -175,7 +208,8 @@ static HTSV_Result heap_prune_satisfies_vacuum(PruneState *prstate,
 static inline HTSV_Result htsv_get_valid_status(int status);
 static void heap_prune_chain(OffsetNumber maxoff,
 							 OffsetNumber rootoffnum, PruneState *prstate);
-static void heap_prune_record_prunable(PruneState *prstate, TransactionId xid);
+static void heap_prune_record_prunable(PruneState *prstate, TransactionId xid,
+									   OffsetNumber offnum);
 static void heap_prune_record_redirect(PruneState *prstate,
 									   OffsetNumber offnum, OffsetNumber rdoffnum,
 									   bool was_normal);
@@ -209,8 +243,9 @@ static bool heap_page_will_freeze(bool did_tuple_hint_fpi, bool do_prune, bool d
  * Caller must have pin on the buffer, and must *not* have a lock on it.
  *
  * This function may pin *vmbuffer. It's passed by reference so the caller can
- * reuse the pin across calls, avoiding repeated pin/unpin cycles. Caller is
- * responsible for unpinning it.
+ * reuse the pin across calls, avoiding repeated pin/unpin cycles. If we find
+ * VM corruption during pruning, we will fix it. Caller is responsible for
+ * unpinning *vmbuffer.
  */
 void
 heap_page_prune_opt(Relation relation, Buffer buffer, Buffer *vmbuffer)
@@ -277,6 +312,16 @@ heap_page_prune_opt(Relation relation, Buffer buffer, Buffer *vmbuffer)
 		{
 			OffsetNumber dummy_off_loc;
 			PruneFreezeResult presult;
+			PruneFreezeParams params;
+
+			visibilitymap_pin(relation, BufferGetBlockNumber(buffer), vmbuffer);
+
+			params.relation = relation;
+			params.buffer = buffer;
+			params.vmbuffer = *vmbuffer;
+			params.reason = PRUNE_ON_ACCESS;
+			params.vistest = vistest;
+			params.cutoffs = NULL;
 
 			/*
 			 * We don't pass the HEAP_PAGE_PRUNE_MARK_UNUSED_NOW option
@@ -284,14 +329,7 @@ heap_page_prune_opt(Relation relation, Buffer buffer, Buffer *vmbuffer)
 			 * cannot safely determine that during on-access pruning with the
 			 * current implementation.
 			 */
-			PruneFreezeParams params = {
-				.relation = relation,
-				.buffer = buffer,
-				.reason = PRUNE_ON_ACCESS,
-				.options = 0,
-				.vistest = vistest,
-				.cutoffs = NULL,
-			};
+			params.options = 0;
 
 			heap_page_prune_and_freeze(&params, &presult, &dummy_off_loc,
 									   NULL, NULL);
@@ -354,6 +392,12 @@ prune_freeze_setup(PruneFreezeParams *params,
 	prstate->buffer = params->buffer;
 	prstate->page = BufferGetPage(params->buffer);
 
+	Assert(BufferIsValid(params->vmbuffer));
+	prstate->vmbuffer = params->vmbuffer;
+	prstate->old_vmbits = visibilitymap_get_status(prstate->relation,
+												   prstate->block,
+												   &prstate->vmbuffer);
+
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
 	 * then apply the changes within a critical section.  This keeps as much
@@ -770,6 +814,104 @@ heap_page_will_freeze(bool did_tuple_hint_fpi,
 	return do_freeze;
 }
 
+/*
+ * Emit a warning about and fix visibility map corruption on the given page.
+ *
+ * The caller specifies the type of corruption it has already detected via
+ * corruption_type, so that we can emit the appropriate warning. All cases
+ * result in the VM bits being cleared; page-level corruption types also clear
+ * PD_ALL_VISIBLE.
+ *
+ * Must be called while holding an exclusive lock on the heap buffer. Dead
+ * items must have been discovered under that same lock. Although we do not
+ * hold a lock on the VM buffer, it is pinned, and the heap buffer is
+ * exclusively locked, ensuring that no other backend can update the VM bits
+ * corresponding to this heap page.
+ *
+ * This function makes changes to the VM and, potentially, the heap page, but
+ * it does not need to be done in a critical section.
+ */
+static void
+heap_page_fix_vm_corruption(PruneState *prstate, OffsetNumber offnum,
+							VMCorruptionType corruption_type)
+{
+	const char *relname = RelationGetRelationName(prstate->relation);
+
+	Assert(BufferIsLockedByMeInMode(prstate->buffer, BUFFER_LOCK_EXCLUSIVE));
+
+	switch (corruption_type)
+	{
+		case VM_CORRUPT_LPDEAD:
+			ereport(WARNING,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("dead line pointer found on page marked all-visible"),
+					 errcontext("relation \"%s\", page %u, tuple %u",
+								relname, prstate->block, offnum)));
+			break;
+
+		case VM_CORRUPT_TUPLE_VISIBILITY:
+
+			/*
+			 * A HEAPTUPLE_LIVE tuple on an all-visible page can appear to not
+			 * be visible to everyone when
+			 * GetOldestNonRemovableTransactionId() returns a conservative
+			 * value that's older than the real safe xmin. That is not
+			 * corruption -- the PD_ALL_VISIBLE flag is still correct.
+			 *
+			 * However, dead tuple versions, in-progress inserts, and
+			 * in-progress deletes should never appear on a page marked
+			 * all-visible. That indicates real corruption. PD_ALL_VISIBLE
+			 * should have been cleared by the DML operation that deleted or
+			 * inserted the tuple.
+			 */
+			ereport(WARNING,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("tuple not visible to all transactions found on page marked all-visible"),
+					 errcontext("relation \"%s\", page %u, tuple %u",
+								relname, prstate->block, offnum)));
+			break;
+
+		case VM_CORRUPT_MISSING_PAGE_HINT:
+
+			/*
+			 * As of PostgreSQL 9.2, the visibility map bit should never be
+			 * set if the page-level bit is clear. However, for vacuum, it's
+			 * possible that the bit got cleared after
+			 * heap_vac_scan_next_block() was called, so we must recheck now
+			 * that we have the buffer lock before concluding that the VM is
+			 * corrupt.
+			 */
+			Assert(!PageIsAllVisible(prstate->page));
+			Assert(prstate->old_vmbits & VISIBILITYMAP_VALID_BITS);
+			ereport(WARNING,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("page is not marked all-visible but visibility map bit is set"),
+					 errcontext("relation \"%s\", page %u",
+								relname, prstate->block)));
+			break;
+
+		default:
+			elog(ERROR, "unrecognized VM corruption type: %d",
+				 (int) corruption_type);
+			break;
+	}
+
+	/*
+	 * Clear PD_ALL_VISIBLE on the heap page if it is set.
+	 * VM_CORRUPT_MISSING_PAGE_HINT is already clear by definition, so avoid
+	 * marking the buffer dirty.
+	 */
+	if (corruption_type != VM_CORRUPT_MISSING_PAGE_HINT)
+	{
+		Assert(PageIsAllVisible(prstate->page));
+		PageClearAllVisible(prstate->page);
+		MarkBufferDirtyHint(prstate->buffer, true);
+	}
+
+	visibilitymap_clear(prstate->relation, prstate->block, prstate->vmbuffer,
+						VISIBILITYMAP_VALID_BITS);
+	prstate->old_vmbits = 0;
+}
 
 /*
  * Prune and repair fragmentation and potentially freeze tuples on the
@@ -830,6 +972,16 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
 					   new_relfrozen_xid, new_relmin_mxid,
 					   presult, &prstate);
 
+	/*
+	 * If the VM is set but PD_ALL_VISIBLE is clear, fix that corruption
+	 * before pruning and freezing so that the page and VM start out in a
+	 * consistent state.
+	 */
+	if ((prstate.old_vmbits & VISIBILITYMAP_VALID_BITS) &&
+		!PageIsAllVisible(prstate.page))
+		heap_page_fix_vm_corruption(&prstate, InvalidOffsetNumber,
+									VM_CORRUPT_MISSING_PAGE_HINT);
+
 	/*
 	 * Examine all line pointers and tuple visibility information to determine
 	 * which line pointers should change state and which tuples may be frozen.
@@ -973,6 +1125,7 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
 	presult->set_all_visible = prstate.set_all_visible;
 	presult->set_all_frozen = prstate.set_all_frozen;
 	presult->hastup = prstate.hastup;
+	presult->old_vmbits = prstate.old_vmbits;
 
 	/*
 	 * For callers planning to update the visibility map, the conflict horizon
@@ -1295,7 +1448,8 @@ process_chain:
 
 /* Record lowest soon-prunable XID */
 static void
-heap_prune_record_prunable(PruneState *prstate, TransactionId xid)
+heap_prune_record_prunable(PruneState *prstate, TransactionId xid,
+						   OffsetNumber offnum)
 {
 	/*
 	 * This should exactly match the PageSetPrunable macro.  We can't store
@@ -1305,6 +1459,14 @@ heap_prune_record_prunable(PruneState *prstate, TransactionId xid)
 	if (!TransactionIdIsValid(prstate->new_prune_xid) ||
 		TransactionIdPrecedes(xid, prstate->new_prune_xid))
 		prstate->new_prune_xid = xid;
+
+	/*
+	 * It's incorrect for a page to be marked all-visible if it contains
+	 * prunable items.
+	 */
+	if (PageIsAllVisible(prstate->page))
+		heap_page_fix_vm_corruption(prstate, offnum,
+									VM_CORRUPT_TUPLE_VISIBILITY);
 }
 
 /* Record line pointer to be redirected */
@@ -1388,6 +1550,15 @@ heap_prune_record_dead_or_unused(PruneState *prstate, OffsetNumber offnum,
 		heap_prune_record_unused(prstate, offnum, was_normal);
 	else
 		heap_prune_record_dead(prstate, offnum, was_normal);
+
+	/*
+	 * It's incorrect for the page to be set all-visible if it contains dead
+	 * items. Fix that on the heap page and check the VM for corruption as
+	 * well. Do that here rather than in heap_prune_record_dead() so we also
+	 * cover tuples that are directly marked LP_UNUSED via mark_unused_now.
+	 */
+	if (PageIsAllVisible(prstate->page))
+		heap_page_fix_vm_corruption(prstate, offnum, VM_CORRUPT_LPDEAD);
 }
 
 /* Record line pointer to be marked unused */
@@ -1527,7 +1698,8 @@ heap_prune_record_unchanged_lp_normal(PruneState *prstate, OffsetNumber offnum)
 			 * that the page is reconsidered for pruning in future.
 			 */
 			heap_prune_record_prunable(prstate,
-									   HeapTupleHeaderGetUpdateXid(htup));
+									   HeapTupleHeaderGetUpdateXid(htup),
+									   offnum);
 			break;
 
 		case HEAPTUPLE_INSERT_IN_PROGRESS:
@@ -1542,6 +1714,11 @@ heap_prune_record_unchanged_lp_normal(PruneState *prstate, OffsetNumber offnum)
 			prstate->set_all_visible = false;
 			prstate->set_all_frozen = false;
 
+			/* The page should not be marked all-visible */
+			if (PageIsAllVisible(page))
+				heap_page_fix_vm_corruption(prstate, offnum,
+											VM_CORRUPT_TUPLE_VISIBILITY);
+
 			/*
 			 * If we wanted to optimize for aborts, we might consider marking
 			 * the page prunable when we see INSERT_IN_PROGRESS.  But we
@@ -1566,7 +1743,8 @@ heap_prune_record_unchanged_lp_normal(PruneState *prstate, OffsetNumber offnum)
 			 * the page is reconsidered for pruning in future.
 			 */
 			heap_prune_record_prunable(prstate,
-									   HeapTupleHeaderGetUpdateXid(htup));
+									   HeapTupleHeaderGetUpdateXid(htup),
+									   offnum);
 			break;
 
 		default:
@@ -1632,6 +1810,13 @@ heap_prune_record_unchanged_lp_dead(PruneState *prstate, OffsetNumber offnum)
 
 	/* Record the dead offset for vacuum */
 	prstate->deadoffsets[prstate->lpdead_items++] = offnum;
+
+	/*
+	 * It's incorrect for a page to be marked all-visible if it contains dead
+	 * items.
+	 */
+	if (PageIsAllVisible(prstate->page))
+		heap_page_fix_vm_corruption(prstate, offnum, VM_CORRUPT_LPDEAD);
 }
 
 /*
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index c57432670e7..56722556417 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -432,11 +432,6 @@ static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
-static void identify_and_fix_vm_corruption(Relation rel, Buffer heap_buffer,
-										   BlockNumber heap_blk, Page heap_page,
-										   int nlpdead_items,
-										   Buffer vmbuffer,
-										   uint8 *vmbits);
 static int	lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 							BlockNumber blkno, Page page,
 							Buffer vmbuffer,
@@ -1989,81 +1984,6 @@ cmpOffsetNumbers(const void *a, const void *b)
 	return pg_cmp_u16(*(const OffsetNumber *) a, *(const OffsetNumber *) b);
 }
 
-/*
- * Helper to correct any corruption detected on a heap page and its
- * corresponding visibility map page after pruning but before setting the
- * visibility map. It examines the heap page, the associated VM page, and the
- * number of dead items previously identified.
- *
- * This function must be called while holding an exclusive lock on the heap
- * buffer, and the dead items must have been discovered under that same lock.
-
- * The provided vmbits must reflect the current state of the VM block
- * referenced by vmbuffer. Although we do not hold a lock on the VM buffer, it
- * is pinned, and the heap buffer is exclusively locked, ensuring that no
- * other backend can update the VM bits corresponding to this heap page.
- *
- * If it clears corruption, it will zero out vmbits.
- */
-static void
-identify_and_fix_vm_corruption(Relation rel, Buffer heap_buffer,
-							   BlockNumber heap_blk, Page heap_page,
-							   int nlpdead_items,
-							   Buffer vmbuffer,
-							   uint8 *vmbits)
-{
-	Assert(visibilitymap_get_status(rel, heap_blk, &vmbuffer) == *vmbits);
-
-	Assert(BufferIsLockedByMeInMode(heap_buffer, BUFFER_LOCK_EXCLUSIVE));
-
-	/*
-	 * As of PostgreSQL 9.2, the visibility map bit should never be set if the
-	 * page-level bit is clear.  However, it's possible that the bit got
-	 * cleared after heap_vac_scan_next_block() was called, so we must recheck
-	 * with buffer lock before concluding that the VM is corrupt.
-	 */
-	if (!PageIsAllVisible(heap_page) &&
-		((*vmbits & VISIBILITYMAP_VALID_BITS) != 0))
-	{
-		ereport(WARNING,
-				(errcode(ERRCODE_DATA_CORRUPTED),
-				 errmsg("page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
-						RelationGetRelationName(rel), heap_blk)));
-
-		visibilitymap_clear(rel, heap_blk, vmbuffer,
-							VISIBILITYMAP_VALID_BITS);
-		*vmbits = 0;
-	}
-
-	/*
-	 * It's possible for the value returned by
-	 * GetOldestNonRemovableTransactionId() to move backwards, so it's not
-	 * wrong for us to see tuples that appear to not be visible to everyone
-	 * yet, while PD_ALL_VISIBLE is already set. The real safe xmin value
-	 * never moves backwards, but GetOldestNonRemovableTransactionId() is
-	 * conservative and sometimes returns a value that's unnecessarily small,
-	 * so if we see that contradiction it just means that the tuples that we
-	 * think are not visible to everyone yet actually are, and the
-	 * PD_ALL_VISIBLE flag is correct.
-	 *
-	 * There should never be LP_DEAD items on a page with PD_ALL_VISIBLE set,
-	 * however.
-	 */
-	else if (PageIsAllVisible(heap_page) && nlpdead_items > 0)
-	{
-		ereport(WARNING,
-				(errcode(ERRCODE_DATA_CORRUPTED),
-				 errmsg("page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
-						RelationGetRelationName(rel), heap_blk)));
-
-		PageClearAllVisible(heap_page);
-		MarkBufferDirty(heap_buffer);
-		visibilitymap_clear(rel, heap_blk, vmbuffer,
-							VISIBILITYMAP_VALID_BITS);
-		*vmbits = 0;
-	}
-}
-
 /*
  *	lazy_scan_prune() -- lazy_scan_heap() pruning and freezing.
  *
@@ -2095,6 +2015,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	PruneFreezeParams params = {
 		.relation = rel,
 		.buffer = buf,
+		.vmbuffer = vmbuffer,
 		.reason = PRUNE_VACUUM_SCAN,
 		.options = HEAP_PAGE_PRUNE_FREEZE,
 		.vistest = vacrel->vistest,
@@ -2204,18 +2125,12 @@ lazy_scan_prune(LVRelState *vacrel,
 	Assert(!presult.set_all_visible || !(*has_lpdead_items));
 	Assert(!presult.set_all_frozen || presult.set_all_visible);
 
-	old_vmbits = visibilitymap_get_status(vacrel->rel, blkno, &vmbuffer);
-
-	identify_and_fix_vm_corruption(vacrel->rel, buf, blkno, page,
-								   presult.lpdead_items, vmbuffer,
-								   &old_vmbits);
-
 	if (!presult.set_all_visible)
 		return presult.ndeleted;
 
 	/* Set the visibility map and page visibility hint */
+	old_vmbits = presult.old_vmbits;
 	new_vmbits = VISIBILITYMAP_ALL_VISIBLE;
-
 	if (presult.set_all_frozen)
 		new_vmbits |= VISIBILITYMAP_ALL_FROZEN;
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 2fdc50b865b..00134012137 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -262,6 +262,12 @@ typedef struct PruneFreezeParams
 	Relation	relation;		/* relation containing buffer to be pruned */
 	Buffer		buffer;			/* buffer to be pruned */
 
+	/*
+	 * Callers should provide a pinned vmbuffer corresponding to the heap
+	 * block in buffer. We will check for and repair any corruption in the VM.
+	 */
+	Buffer		vmbuffer;
+
 	/*
 	 * The reason pruning was performed.  It is used to set the WAL record
 	 * opcode which is used for debugging and analysis purposes.
@@ -324,6 +330,12 @@ typedef struct PruneFreezeResult
 	bool		set_all_frozen;
 	TransactionId vm_conflict_horizon;
 
+	/*
+	 * The value of the vmbuffer's vmbits at the beginning of pruning. It is
+	 * cleared if VM corruption is found and corrected.
+	 */
+	uint8		old_vmbits;
+
 	/*
 	 * Whether or not the page makes rel truncation unsafe.  This is set to
 	 * 'true', even if the page contains LP_DEAD items.  VACUUM will remove
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a4a2ed07816..480614d483b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3273,6 +3273,7 @@ UserAuth
 UserContext
 UserMapping
 UserOpts
+VMCorruptionType
 VacAttrStats
 VacAttrStatsP
 VacDeadItemsInfo
-- 
2.43.0



  [text/x-patch] v41-0002-Add-pruning-fast-path-for-all-visible-and-all-fr.patch (7.7K, 3-v41-0002-Add-pruning-fast-path-for-all-visible-and-all-fr.patch)
  download | inline diff:
From 0c9f91eec0127bc914c7fbe79256c6e5b689cde8 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Wed, 25 Feb 2026 16:48:19 -0500
Subject: [PATCH v41 02/12] Add pruning fast path for all-visible and
 all-frozen pages

Because of the SKIP_PAGES_THRESHOLD optimization or a stale prune XID,
heap_page_prune_and_freeze() can be invoked for pages with no pruning or
freezing work. To avoid this, if a page is already all-frozen or it is
all-visible and no freezing will be attempted, we exit early. We can't
exit early if vacuum passed DISABLE_PAGE_SKIPPING, though.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Reviewed-by: Kirill Reshke <[email protected]>
Discussion: https://postgr.es/m/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk
---
 src/backend/access/heap/pruneheap.c  | 97 +++++++++++++++++++++++++++-
 src/backend/access/heap/vacuumlazy.c | 10 +++
 src/include/access/heapam.h          |  1 +
 3 files changed, 107 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index e452d25cae6..19a72ac6b27 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -129,6 +129,12 @@ typedef struct
 	/* Bits in the vmbuffer for this heap page */
 	uint8		old_vmbits;
 
+	/*
+	 * True if the page can bypass full page inspection during pruning and
+	 * freezing based on its visibility map status and the caller's options.
+	 */
+	bool		fast_path;
+
 	/*-------------------------------------------------------
 	 * Information about what was done
 	 *
@@ -201,6 +207,7 @@ static void prune_freeze_setup(PruneFreezeParams *params,
 							   PruneState *prstate);
 static void heap_page_fix_vm_corruption(PruneState *prstate, OffsetNumber offnum,
 										VMCorruptionType ctype);
+static void prune_freeze_bypass(PruneState *prstate, PruneFreezeResult *presult);
 static void prune_freeze_plan(PruneState *prstate,
 							  OffsetNumber *off_loc);
 static HTSV_Result heap_prune_satisfies_vacuum(PruneState *prstate,
@@ -329,7 +336,7 @@ heap_page_prune_opt(Relation relation, Buffer buffer, Buffer *vmbuffer)
 			 * cannot safely determine that during on-access pruning with the
 			 * current implementation.
 			 */
-			params.options = 0;
+			params.options = HEAP_PAGE_PRUNE_ALLOW_FAST_PATH;
 
 			heap_page_prune_and_freeze(&params, &presult, &dummy_off_loc,
 									   NULL, NULL);
@@ -398,6 +405,16 @@ prune_freeze_setup(PruneFreezeParams *params,
 												   prstate->block,
 												   &prstate->vmbuffer);
 
+	/*
+	 * If the page is already all-frozen, or already all-visible when freezing
+	 * is not being attempted, we can skip pruning and freezing entirely.
+	 * Callers must opt in by setting HEAP_PAGE_PRUNE_ALLOW_FAST_PATH.
+	 */
+	prstate->fast_path = ((prstate->old_vmbits & VISIBILITYMAP_ALL_FROZEN) ||
+						  ((prstate->old_vmbits & VISIBILITYMAP_ALL_VISIBLE) &&
+						   !prstate->attempt_freeze)) &&
+		(params->options & HEAP_PAGE_PRUNE_ALLOW_FAST_PATH);
+
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
 	 * then apply the changes within a critical section.  This keeps as much
@@ -913,6 +930,73 @@ heap_page_fix_vm_corruption(PruneState *prstate, OffsetNumber offnum,
 	prstate->old_vmbits = 0;
 }
 
+/*
+ * If the page is already all-frozen, or already all-visible and freezing
+ * is not being attempted, there is no remaining work and we can bypass the
+ * expensive overhead of heap_page_prune_and_freeze().
+ *
+ * This can happen when the page has a stale prune hint, or if VACUUM is
+ * scanning an already all-frozen page due to SKIP_PAGES_THRESHOLD.
+ *
+ * The caller must already have examined the visibility map and saved the
+ * status for the page's VM bits in prstate->old_vmbits. Caller must hold a
+ * content lock on the heap page since it will examine line pointers.
+ *
+ * Before calling prune_freeze_bypass(), the caller should first
+ * check for and fix any discrepancy between the page-level visibility hint
+ * and the visibility map. Otherwise, the fast path will always prevent us
+ * from getting them in sync. Note that if there are tuples on the page that
+ * are not visible to all but the VM is incorrectly marked
+ * all-visible/all-frozen, we will not get the chance to fix that corruption
+ * when using the fast path.
+ */
+static void
+prune_freeze_bypass(PruneState *prstate, PruneFreezeResult *presult)
+{
+	OffsetNumber maxoff = PageGetMaxOffsetNumber(prstate->page);
+	Page		page = prstate->page;
+
+	Assert(prstate->old_vmbits & VISIBILITYMAP_ALL_FROZEN ||
+		   (prstate->old_vmbits & VISIBILITYMAP_ALL_VISIBLE &&
+			!prstate->attempt_freeze));
+
+	/* We'll fill in presult for the caller */
+	memset(presult, 0, sizeof(PruneFreezeResult));
+
+	presult->old_vmbits = prstate->old_vmbits;
+
+	/* Clear any stale prune hint */
+	if (TransactionIdIsValid(PageGetPruneXid(page)))
+	{
+		PageClearPrunable(page);
+		MarkBufferDirtyHint(prstate->buffer, true);
+	}
+
+	if (PageIsEmpty(page))
+		return;
+
+	/*
+	 * Since the page is all-visible, a count of the normal ItemIds on the
+	 * page should be sufficient for vacuum's live tuple count.
+	 */
+	for (OffsetNumber off = FirstOffsetNumber;
+		 off <= maxoff;
+		 off = OffsetNumberNext(off))
+	{
+		if (ItemIdIsNormal(PageGetItemId(page, off)))
+		{
+			/*
+			 * Now that we've found an actual tuple, set hastup. If the page
+			 * is entirely LP_UNUSED, we want vacuum to still truncate it.
+			 */
+			presult->hastup = true;
+			prstate->live_tuples++;
+		}
+	}
+
+	presult->live_tuples = prstate->live_tuples;
+}
+
 /*
  * Prune and repair fragmentation and potentially freeze tuples on the
  * specified page.
@@ -982,6 +1066,17 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
 		heap_page_fix_vm_corruption(&prstate, InvalidOffsetNumber,
 									VM_CORRUPT_MISSING_PAGE_HINT);
 
+	/*
+	 * If the visibility map status allows it, bypass pruning and freezing
+	 * entirely. This must be done after fixing any discrepancy between the
+	 * page-level visibility hint and the VM.
+	 */
+	if (prstate.fast_path)
+	{
+		prune_freeze_bypass(&prstate, presult);
+		return;
+	}
+
 	/*
 	 * Examine all line pointers and tuple visibility information to determine
 	 * which line pointers should change state and which tuples may be frozen.
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 56722556417..1a446050d85 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2044,6 +2044,16 @@ lazy_scan_prune(LVRelState *vacrel,
 	if (vacrel->nindexes == 0)
 		params.options |= HEAP_PAGE_PRUNE_MARK_UNUSED_NOW;
 
+	/*
+	 * Allow skipping full inspection of pages that the VM indicates are
+	 * already all-frozen (which may be scanned due to SKIP_PAGES_THRESHOLD).
+	 * However, if DISABLE_PAGE_SKIPPING was specified, we can't trust the VM,
+	 * so we must examine the page to make sure it is truly all-frozen and fix
+	 * it otherwise.
+	 */
+	if (vacrel->skipwithvm)
+		params.options |= HEAP_PAGE_PRUNE_ALLOW_FAST_PATH;
+
 	heap_page_prune_and_freeze(&params,
 							   &presult,
 							   &vacrel->offnum,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 00134012137..305ecc31a9e 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -42,6 +42,7 @@
 /* "options" flag bits for heap_page_prune_and_freeze */
 #define HEAP_PAGE_PRUNE_MARK_UNUSED_NOW		(1 << 0)
 #define HEAP_PAGE_PRUNE_FREEZE				(1 << 1)
+#define HEAP_PAGE_PRUNE_ALLOW_FAST_PATH		(1 << 2)
 
 typedef struct BulkInsertStateData *BulkInsertState;
 typedef struct GlobalVisState GlobalVisState;
-- 
2.43.0



  [text/x-patch] v41-0003-Use-GlobalVisState-in-vacuum-to-determine-page-l.patch (12.3K, 4-v41-0003-Use-GlobalVisState-in-vacuum-to-determine-page-l.patch)
  download | inline diff:
From 933ccfc1fa4f652c9a9f0be7cac5abebf4ddf7c1 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Wed, 17 Dec 2025 16:51:05 -0500
Subject: [PATCH v41 03/12] Use GlobalVisState in vacuum to determine page
 level visibility

During vacuum's first and third phases, we examine tuples' visibility
to determine if we can set the page all-visible in the visibility map.

Previously, this check compared tuple xmins against a single XID chosen at
the start of vacuum (OldestXmin). We now use GlobalVisState, which also
enables future work to set the VM during on-access pruning, since ordinary
queries have access to GlobalVisState but not OldestXmin.

This also benefits vacuum: in some cases, GlobalVisState may advance
during a vacuum, allowing more pages to become considered all-visible.
And, in the future, we could easily add a heuristic to update
GlobalVisState more frequently during vacuums of large tables.

OldestXmin is still used for freezing and as a backstop to ensure we
don't freeze a dead tuple that wasn't yet prunable according to
GlobalVisState in the rare occurrences where GlobalVisState moves
backwards.

Because comparing a transaction ID against GlobalVisState is more
expensive than comparing against a single XID, we defer this check until
after scanning all tuples on the page. Therefore, we perform the
GlobalVisState check only once per page. This is safe because
visibility_cutoff_xid records the newest live xmin on the page;
if it is globally visible, then the entire page is all-visible.

Using GlobalVisState means on-access pruning can also maintain
visibility_cutoff_xid. This approach will result in examining more tuple
xmins than before; however, the additional cost should not be
significant. And doing so will enable us to set the visibility map on
access in the future.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Reviewed-by: Kirill Reshke <[email protected]>
Discussion: https://postgr.es/m/flat/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk#c755ef151507aba58471ffaca607e493
---
 src/backend/access/heap/heapam_visibility.c | 22 ++++++++++
 src/backend/access/heap/pruneheap.c         | 48 ++++++++++-----------
 src/backend/access/heap/vacuumlazy.c        | 48 +++++++++++++--------
 src/include/access/heapam.h                 |  2 +
 4 files changed, 79 insertions(+), 41 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index fc64f4343ce..d70fab3a763 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1131,6 +1131,28 @@ HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
 	return res;
 }
 
+/*
+ * Wrapper around GlobalVisTestIsRemovableXid() for use when examining live
+ * tuples. Returns true if the given XID may be considered running by at least
+ * one snapshot.
+ *
+ * This function alone is insufficient to determine tuple visibility; callers
+ * must also consider the XID's commit status. Its purpose is purely semantic:
+ * when applied to live tuples, GlobalVisTestIsRemovableXid() is checking
+ * whether the inserting transaction is still considered running, not whether
+ * the tuple is removable. Live tuples are, by definition, not removable, but
+ * the snapshot criteria for “transaction still running” are identical to
+ * those used for removal XIDs.
+ *
+ * See the comment above GlobalVisTestIsRemovable[Full]Xid() for details on the
+ * required preconditions for calling this function.
+ */
+bool
+GlobalVisTestXidMaybeRunning(GlobalVisState *state, TransactionId xid)
+{
+	return !GlobalVisTestIsRemovableXid(state, xid);
+}
+
 /*
  * Work horse for HeapTupleSatisfiesVacuum and similar routines.
  *
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 19a72ac6b27..f437579076e 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -166,10 +166,13 @@ typedef struct
 	 * all-frozen bits in the visibility map can be set for this page after
 	 * pruning.
 	 *
-	 * visibility_cutoff_xid is the newest xmin of live tuples on the page.
-	 * The caller can use it as the conflict horizon, when setting the VM
-	 * bits.  It is only valid if we froze some tuples, and set_all_frozen is
-	 * true.
+	 * visibility_cutoff_xid is the newest xmin of live tuples on the page. It
+	 * is used after processing all tuples to determine if the page can be
+	 * considered all-visible (if the newest xmin is still considered running
+	 * by some snapshot, it cannot be). It is also used by the caller as the
+	 * conflict horizon when setting the VM bits, unless we froze all tuples
+	 * on the page (in which case the conflict xid was already included in the
+	 * WAL record).
 	 *
 	 * NOTE: set_all_visible and set_all_frozen initially don't include
 	 * LP_DEAD items. That's convenient for heap_page_prune_and_freeze() to
@@ -1085,6 +1088,17 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
 	 */
 	prune_freeze_plan(&prstate, off_loc);
 
+	/*
+	 * After processing all the live tuples on the page, if the newest xmin
+	 * amongst them may be considered running by any snapshot, the page cannot
+	 * be all-visible.
+	 */
+	if (prstate.set_all_visible &&
+		TransactionIdIsNormal(prstate.visibility_cutoff_xid) &&
+		GlobalVisTestXidMaybeRunning(prstate.vistest,
+									 prstate.visibility_cutoff_xid))
+		prstate.set_all_visible = prstate.set_all_frozen = false;
+
 	/*
 	 * If checksums are enabled, calling heap_prune_satisfies_vacuum() while
 	 * checking tuple visibility information in prune_freeze_plan() may have
@@ -1753,29 +1767,15 @@ heap_prune_record_unchanged_lp_normal(PruneState *prstate, OffsetNumber offnum)
 				}
 
 				/*
-				 * The inserter definitely committed.  But is it old enough
-				 * that everyone sees it as committed?  A FrozenTransactionId
-				 * is seen as committed to everyone.  Otherwise, we check if
-				 * there is a snapshot that considers this xid to still be
-				 * running, and if so, we don't consider the page all-visible.
+				 * The inserter definitely committed. But we don't know if it
+				 * is old enough that everyone sees it as committed. Later,
+				 * after processing all the tuples on the page, we'll check if
+				 * there is any snapshot that still considers the newest xid
+				 * on the page to be running. If so, we don't consider the
+				 * page all-visible.
 				 */
 				xmin = HeapTupleHeaderGetXmin(htup);
 
-				/*
-				 * For now always use prstate->cutoffs for this test, because
-				 * we only update 'set_all_visible' and 'set_all_frozen' when
-				 * freezing is requested. We could use
-				 * GlobalVisTestIsRemovableXid instead, if a non-freezing
-				 * caller wanted to set the VM bit.
-				 */
-				Assert(prstate->cutoffs);
-				if (!TransactionIdPrecedes(xmin, prstate->cutoffs->OldestXmin))
-				{
-					prstate->set_all_visible = false;
-					prstate->set_all_frozen = false;
-					break;
-				}
-
 				/* Track newest xmin on page. */
 				if (TransactionIdFollows(xmin, prstate->visibility_cutoff_xid) &&
 					TransactionIdIsNormal(xmin))
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1a446050d85..2a94ba3a387 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -468,13 +468,13 @@ static void dead_items_cleanup(LVRelState *vacrel);
 
 #ifdef USE_ASSERT_CHECKING
 static bool heap_page_is_all_visible(Relation rel, Buffer buf,
-									 TransactionId OldestXmin,
+									 GlobalVisState *vistest,
 									 bool *all_frozen,
 									 TransactionId *visibility_cutoff_xid,
 									 OffsetNumber *logging_offnum);
 #endif
 static bool heap_page_would_be_all_visible(Relation rel, Buffer buf,
-										   TransactionId OldestXmin,
+										   GlobalVisState *vistest,
 										   OffsetNumber *deadoffsets,
 										   int ndeadoffsets,
 										   bool *all_frozen,
@@ -2089,7 +2089,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		Assert(presult.lpdead_items == 0);
 
 		Assert(heap_page_is_all_visible(vacrel->rel, buf,
-										vacrel->cutoffs.OldestXmin, &debug_all_frozen,
+										vacrel->vistest, &debug_all_frozen,
 										&debug_cutoff, &vacrel->offnum));
 
 		Assert(presult.set_all_frozen == debug_all_frozen);
@@ -2852,7 +2852,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	 * done outside the critical section.
 	 */
 	if (heap_page_would_be_all_visible(vacrel->rel, buffer,
-									   vacrel->cutoffs.OldestXmin,
+									   vacrel->vistest,
 									   deadoffsets, num_offsets,
 									   &all_frozen, &visibility_cutoff_xid,
 									   &vacrel->offnum))
@@ -3614,14 +3614,14 @@ dead_items_cleanup(LVRelState *vacrel)
  */
 static bool
 heap_page_is_all_visible(Relation rel, Buffer buf,
-						 TransactionId OldestXmin,
+						 GlobalVisState *vistest,
 						 bool *all_frozen,
 						 TransactionId *visibility_cutoff_xid,
 						 OffsetNumber *logging_offnum)
 {
 
 	return heap_page_would_be_all_visible(rel, buf,
-										  OldestXmin,
+										  vistest,
 										  NULL, 0,
 										  all_frozen,
 										  visibility_cutoff_xid,
@@ -3642,7 +3642,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
  * Returns true if the page is all-visible other than the provided
  * deadoffsets and false otherwise.
  *
- * OldestXmin is used to determine visibility.
+ * vistest is used to determine visibility.
  *
  * Output parameters:
  *
@@ -3661,7 +3661,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
  */
 static bool
 heap_page_would_be_all_visible(Relation rel, Buffer buf,
-							   TransactionId OldestXmin,
+							   GlobalVisState *vistest,
 							   OffsetNumber *deadoffsets,
 							   int ndeadoffsets,
 							   bool *all_frozen,
@@ -3742,7 +3742,7 @@ heap_page_would_be_all_visible(Relation rel, Buffer buf,
 				{
 					TransactionId xmin;
 
-					/* Check comments in lazy_scan_prune. */
+					/* Check heap_prune_record_unchanged_lp_normal comments */
 					if (!HeapTupleHeaderXminCommitted(tuple.t_data))
 					{
 						all_visible = false;
@@ -3751,16 +3751,17 @@ heap_page_would_be_all_visible(Relation rel, Buffer buf,
 					}
 
 					/*
-					 * The inserter definitely committed. But is it old enough
-					 * that everyone sees it as committed?
+					 * The inserter definitely committed. But we don't know if
+					 * it is old enough that everyone sees it as committed.
+					 * Don't check that now.
+					 *
+					 * If we scan all tuples without finding one that prevents
+					 * the page from being all-visible, we then check whether
+					 * any snapshot still considers the newest XID on the page
+					 * to be running. In that case, the page is not considered
+					 * all-visible.
 					 */
 					xmin = HeapTupleHeaderGetXmin(tuple.t_data);
-					if (!TransactionIdPrecedes(xmin, OldestXmin))
-					{
-						all_visible = false;
-						*all_frozen = false;
-						break;
-					}
 
 					/* Track newest xmin on page. */
 					if (TransactionIdFollows(xmin, *visibility_cutoff_xid) &&
@@ -3789,6 +3790,19 @@ heap_page_would_be_all_visible(Relation rel, Buffer buf,
 		}
 	}							/* scan along page */
 
+	/*
+	 * After processing all the live tuples on the page, if the newest xmin
+	 * among them may still be considered running by any snapshot, the page
+	 * cannot be all-visible.
+	 */
+	if (all_visible &&
+		TransactionIdIsNormal(*visibility_cutoff_xid) &&
+		GlobalVisTestXidMaybeRunning(vistest, *visibility_cutoff_xid))
+	{
+		all_visible = false;
+		*all_frozen = false;
+	}
+
 	/* Clear the offset information once we have processed the given page. */
 	*logging_offnum = InvalidOffsetNumber;
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 305ecc31a9e..f9dbd70c1c4 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -480,6 +480,8 @@ extern TM_Result HeapTupleSatisfiesUpdate(HeapTuple htup, CommandId curcid,
 										  Buffer buffer);
 extern HTSV_Result HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
 											Buffer buffer);
+
+extern bool GlobalVisTestXidMaybeRunning(GlobalVisState *state, TransactionId xid);
 extern HTSV_Result HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer,
 												   TransactionId *dead_after);
 extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
-- 
2.43.0



  [text/x-patch] v41-0004-Keep-newest-live-XID-up-to-date-even-if-page-not.patch (15.4K, 5-v41-0004-Keep-newest-live-XID-up-to-date-even-if-page-not.patch)
  download | inline diff:
From de07645b084c2e01050ac5fa8a6c80240842673e Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Sat, 28 Feb 2026 16:06:51 -0500
Subject: [PATCH v41 04/12] Keep newest live XID up-to-date even if page not
 all-visible

During pruning, we keep track of the newest xmin of live tuples on the
page visible to all running and future transactions so that we can use
it later as the snapshot conflict horizon when setting the VM if the
page turns out to be all-visible.

Previously, we stopped updating this value once we determined the page
was not all-visible. However, maintaining it even when the page is not
all-visible is inexpensive and makes the snapshot conflict horizon
calculation clearer. This guarantees it won't contain a stale value.

Since we'll keep it up to date all the time now anyway, there's no
reason not to maintain all_visible for on-access pruning. This will
allow us to set the VM on-access in the future.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://postgr.es/m/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk
---
 src/backend/access/heap/pruneheap.c  | 137 +++++++++++----------------
 src/backend/access/heap/vacuumlazy.c |  30 +++---
 2 files changed, 72 insertions(+), 95 deletions(-)

diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index f437579076e..9451e9417f7 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -129,6 +129,9 @@ typedef struct
 	/* Bits in the vmbuffer for this heap page */
 	uint8		old_vmbits;
 
+	/* The newest xmin of live tuples on the page */
+	TransactionId newest_live_xid;
+
 	/*
 	 * True if the page can bypass full page inspection during pruning and
 	 * freezing based on its visibility map status and the caller's options.
@@ -166,14 +169,6 @@ typedef struct
 	 * all-frozen bits in the visibility map can be set for this page after
 	 * pruning.
 	 *
-	 * visibility_cutoff_xid is the newest xmin of live tuples on the page. It
-	 * is used after processing all tuples to determine if the page can be
-	 * considered all-visible (if the newest xmin is still considered running
-	 * by some snapshot, it cannot be). It is also used by the caller as the
-	 * conflict horizon when setting the VM bits, unless we froze all tuples
-	 * on the page (in which case the conflict xid was already included in the
-	 * WAL record).
-	 *
 	 * NOTE: set_all_visible and set_all_frozen initially don't include
 	 * LP_DEAD items. That's convenient for heap_page_prune_and_freeze() to
 	 * use them to decide whether to freeze the page or not.  The
@@ -183,7 +178,6 @@ typedef struct
 	 */
 	bool		set_all_visible;
 	bool		set_all_frozen;
-	TransactionId visibility_cutoff_xid;
 } PruneState;
 
 
@@ -471,53 +465,42 @@ prune_freeze_setup(PruneFreezeParams *params,
 	prstate->deadoffsets = presult->deadoffsets;
 
 	/*
-	 * Vacuum may update the VM after we're done.  We can keep track of
-	 * whether the page will be all-visible and all-frozen after pruning and
-	 * freezing to help the caller to do that.
-	 *
-	 * Currently, only VACUUM sets the VM bits.  To save the effort, only do
-	 * the bookkeeping if the caller needs it.  Currently, that's tied to
-	 * HEAP_PAGE_PRUNE_FREEZE, but it could be a separate flag if you wanted
-	 * to update the VM bits without also freezing or freeze without also
-	 * setting the VM bits.
+	 * We track whether the page will be all-visible/all-frozen at the end of
+	 * pruning and freezing. While examining tuple visibility, we'll set
+	 * set_all_visible to false if there are tuples on the page not visible to
+	 * all running and future transactions. set_all_visible is always
+	 * maintained but only VACUUM will set the VM if the page ends up being
+	 * all-visible.
 	 *
-	 * In addition to telling the caller whether it can set the VM bit, we
-	 * also use 'set_all_visible' and 'set_all_frozen' for our own
-	 * decision-making. If the whole page would become frozen, we consider
-	 * opportunistically freezing tuples.  We will not be able to freeze the
-	 * whole page if there are tuples present that are not visible to everyone
-	 * or if there are dead tuples which are not yet removable.  However, dead
-	 * tuples which will be removed by the end of vacuuming should not
-	 * preclude us from opportunistically freezing.  Because of that, we do
-	 * not immediately clear set_all_visible and set_all_frozen when we see
-	 * LP_DEAD items.  We fix that after scanning the line pointers. We must
-	 * correct set_all_visible and set_all_frozen before we return them to the
-	 * caller, so that the caller doesn't set the VM bits incorrectly.
+	 * We also keep track of the newest live XID, which is used to calculate
+	 * the snapshot conflict horizon for a WAL record setting the VM.
 	 */
-	if (prstate->attempt_freeze)
-	{
-		prstate->set_all_visible = true;
-		prstate->set_all_frozen = true;
-	}
-	else
-	{
-		/*
-		 * Initializing to false allows skipping the work to update them in
-		 * heap_prune_record_unchanged_lp_normal().
-		 */
-		prstate->set_all_visible = false;
-		prstate->set_all_frozen = false;
-	}
+	prstate->set_all_visible = true;
+	prstate->newest_live_xid = InvalidTransactionId;
 
 	/*
-	 * The visibility cutoff xid is the newest xmin of live tuples on the
-	 * page.  In the common case, this will be set as the conflict horizon the
-	 * caller can use for updating the VM.  If, at the end of freezing and
-	 * pruning, the page is all-frozen, there is no possibility that any
-	 * running transaction on the standby does not see tuples on the page as
-	 * all-visible, so the conflict horizon remains InvalidTransactionId.
+	 * Currently, only VACUUM performs freezing, but other callers may in the
+	 * future. We must initialize set_all_frozen based on whether or not the
+	 * caller passed HEAP_PAGE_PRUNE_FREEZE, because if they did not, we won't
+	 * call heap_prepare_freeze_tuple() for each tuple, and set_all_frozen
+	 * will never be cleared for tuples that need freezing.
+	 *
+	 * When freezing is not required (no XIDs/MXIDs older than the freeze
+	 * cutoff), we may still choose to "opportunistically" freeze if doing so
+	 * would make the page all-frozen.
+	 *
+	 * We will not be able to freeze the whole page at the end of vacuum if
+	 * there are tuples present that are not visible to everyone or if there
+	 * are dead tuples which will not be removable. However, dead tuples that
+	 * will be removed by the end of vacuum should not prevent this
+	 * opportunistic freezing.
+	 *
+	 * Therefore, we do not clear set_all_visible and set_all_frozen when we
+	 * encounter LP_DEAD items. Instead, we correct them after deciding
+	 * whether to freeze, but before updating the VM, to avoid setting the VM
+	 * bits incorrectly.
 	 */
-	prstate->visibility_cutoff_xid = InvalidTransactionId;
+	prstate->set_all_frozen = prstate->attempt_freeze;
 }
 
 /*
@@ -747,7 +730,6 @@ heap_page_will_freeze(bool did_tuple_hint_fpi,
 	if (!prstate->attempt_freeze)
 	{
 		Assert(!prstate->set_all_frozen && prstate->nfrozen == 0);
-		Assert(prstate->lpdead_items == 0 || !prstate->set_all_visible);
 		return false;
 	}
 
@@ -1021,9 +1003,8 @@ prune_freeze_bypass(PruneState *prstate, PruneFreezeResult *presult)
  * HEAP_PAGE_PRUNE_FREEZE option is passed, we also set
  * presult->set_all_visible and presult->set_all_frozen after determining
  * whether or not to opportunistically freeze, to indicate if the VM bits can
- * be set.  They are always set to false when the HEAP_PAGE_PRUNE_FREEZE
- * option is not passed, because at the moment only callers that also freeze
- * need that information.
+ * be set. 'all-frozen' is always set to false when the HEAP_PAGE_PRUNE_FREEZE
+ * option is not passed.
  *
  * presult contains output parameters needed by callers, such as the number of
  * tuples removed and the offsets of dead items on the page after pruning.
@@ -1094,9 +1075,9 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
 	 * be all-visible.
 	 */
 	if (prstate.set_all_visible &&
-		TransactionIdIsNormal(prstate.visibility_cutoff_xid) &&
+		TransactionIdIsNormal(prstate.newest_live_xid) &&
 		GlobalVisTestXidMaybeRunning(prstate.vistest,
-									 prstate.visibility_cutoff_xid))
+									 prstate.newest_live_xid))
 		prstate.set_all_visible = prstate.set_all_frozen = false;
 
 	/*
@@ -1247,7 +1228,7 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
 	if (presult->set_all_frozen)
 		presult->vm_conflict_horizon = InvalidTransactionId;
 	else
-		presult->vm_conflict_horizon = prstate.visibility_cutoff_xid;
+		presult->vm_conflict_horizon = prstate.newest_live_xid;
 
 	presult->lpdead_items = prstate.lpdead_items;
 	/* the presult->deadoffsets array was already filled in */
@@ -1708,6 +1689,7 @@ static void
 heap_prune_record_unchanged_lp_normal(PruneState *prstate, OffsetNumber offnum)
 {
 	HeapTupleHeader htup;
+	TransactionId xmin;
 	Page		page = prstate->page;
 
 	Assert(!prstate->processed[offnum]);
@@ -1755,32 +1737,27 @@ heap_prune_record_unchanged_lp_normal(PruneState *prstate, OffsetNumber offnum)
 			 * See SetHintBits for more info.  Check that the tuple is hinted
 			 * xmin-committed because of that.
 			 */
-			if (prstate->set_all_visible)
+			if (!HeapTupleHeaderXminCommitted(htup))
 			{
-				TransactionId xmin;
+				prstate->set_all_visible = false;
+				prstate->set_all_frozen = false;
+				break;
+			}
 
-				if (!HeapTupleHeaderXminCommitted(htup))
-				{
-					prstate->set_all_visible = false;
-					prstate->set_all_frozen = false;
-					break;
-				}
+			/*
+			 * The inserter definitely committed. But we don't know if it is
+			 * old enough that everyone sees it as committed. Later, after
+			 * processing all the tuples on the page, we'll check if there is
+			 * any snapshot that still considers the newest xid on the page to
+			 * be running. If so, we don't consider the page all-visible.
+			 */
+			xmin = HeapTupleHeaderGetXmin(htup);
 
-				/*
-				 * The inserter definitely committed. But we don't know if it
-				 * is old enough that everyone sees it as committed. Later,
-				 * after processing all the tuples on the page, we'll check if
-				 * there is any snapshot that still considers the newest xid
-				 * on the page to be running. If so, we don't consider the
-				 * page all-visible.
-				 */
-				xmin = HeapTupleHeaderGetXmin(htup);
+			/* Track newest xmin on page. */
+			if (TransactionIdFollows(xmin, prstate->newest_live_xid) &&
+				TransactionIdIsNormal(xmin))
+				prstate->newest_live_xid = xmin;
 
-				/* Track newest xmin on page. */
-				if (TransactionIdFollows(xmin, prstate->visibility_cutoff_xid) &&
-					TransactionIdIsNormal(xmin))
-					prstate->visibility_cutoff_xid = xmin;
-			}
 			break;
 
 		case HEAPTUPLE_RECENTLY_DEAD:
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 2a94ba3a387..8599dd7fcfa 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -470,7 +470,7 @@ static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(Relation rel, Buffer buf,
 									 GlobalVisState *vistest,
 									 bool *all_frozen,
-									 TransactionId *visibility_cutoff_xid,
+									 TransactionId *newest_live_xid,
 									 OffsetNumber *logging_offnum);
 #endif
 static bool heap_page_would_be_all_visible(Relation rel, Buffer buf,
@@ -478,7 +478,7 @@ static bool heap_page_would_be_all_visible(Relation rel, Buffer buf,
 										   OffsetNumber *deadoffsets,
 										   int ndeadoffsets,
 										   bool *all_frozen,
-										   TransactionId *visibility_cutoff_xid,
+										   TransactionId *newest_live_xid,
 										   OffsetNumber *logging_offnum);
 static void update_relstats_all_indexes(LVRelState *vacrel);
 static void vacuum_error_callback(void *arg);
@@ -2828,7 +2828,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			nunused = 0;
-	TransactionId visibility_cutoff_xid;
+	TransactionId newest_live_xid;
 	TransactionId conflict_xid = InvalidTransactionId;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
@@ -2854,14 +2854,14 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	if (heap_page_would_be_all_visible(vacrel->rel, buffer,
 									   vacrel->vistest,
 									   deadoffsets, num_offsets,
-									   &all_frozen, &visibility_cutoff_xid,
+									   &all_frozen, &newest_live_xid,
 									   &vacrel->offnum))
 	{
 		vmflags |= VISIBILITYMAP_ALL_VISIBLE;
 		if (all_frozen)
 		{
 			vmflags |= VISIBILITYMAP_ALL_FROZEN;
-			Assert(!TransactionIdIsValid(visibility_cutoff_xid));
+			Assert(!TransactionIdIsValid(newest_live_xid));
 		}
 
 		/*
@@ -2902,7 +2902,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 		visibilitymap_set_vmbits(blkno,
 								 vmbuffer, vmflags,
 								 vacrel->rel->rd_locator);
-		conflict_xid = visibility_cutoff_xid;
+		conflict_xid = newest_live_xid;
 	}
 
 	/*
@@ -3616,7 +3616,7 @@ static bool
 heap_page_is_all_visible(Relation rel, Buffer buf,
 						 GlobalVisState *vistest,
 						 bool *all_frozen,
-						 TransactionId *visibility_cutoff_xid,
+						 TransactionId *newest_live_xid,
 						 OffsetNumber *logging_offnum)
 {
 
@@ -3624,7 +3624,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 										  vistest,
 										  NULL, 0,
 										  all_frozen,
-										  visibility_cutoff_xid,
+										  newest_live_xid,
 										  logging_offnum);
 }
 #endif
@@ -3647,7 +3647,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
  * Output parameters:
  *
  *  - *all_frozen: true if every tuple on the page is frozen
- *  - *visibility_cutoff_xid: newest xmin; valid only if page is all-visible
+ *  - *newest_live_xid: newest xmin of live tuples on the page
  *  - *logging_offnum: OffsetNumber of current tuple being processed;
  *     used by vacuum's error callback system.
  *
@@ -3665,7 +3665,7 @@ heap_page_would_be_all_visible(Relation rel, Buffer buf,
 							   OffsetNumber *deadoffsets,
 							   int ndeadoffsets,
 							   bool *all_frozen,
-							   TransactionId *visibility_cutoff_xid,
+							   TransactionId *newest_live_xid,
 							   OffsetNumber *logging_offnum)
 {
 	Page		page = BufferGetPage(buf);
@@ -3675,7 +3675,7 @@ heap_page_would_be_all_visible(Relation rel, Buffer buf,
 	bool		all_visible = true;
 	int			matched_dead_count = 0;
 
-	*visibility_cutoff_xid = InvalidTransactionId;
+	*newest_live_xid = InvalidTransactionId;
 	*all_frozen = true;
 
 	Assert(ndeadoffsets == 0 || deadoffsets);
@@ -3764,9 +3764,9 @@ heap_page_would_be_all_visible(Relation rel, Buffer buf,
 					xmin = HeapTupleHeaderGetXmin(tuple.t_data);
 
 					/* Track newest xmin on page. */
-					if (TransactionIdFollows(xmin, *visibility_cutoff_xid) &&
+					if (TransactionIdFollows(xmin, *newest_live_xid) &&
 						TransactionIdIsNormal(xmin))
-						*visibility_cutoff_xid = xmin;
+						*newest_live_xid = xmin;
 
 					/* Check whether this tuple is already frozen or not */
 					if (all_visible && *all_frozen &&
@@ -3796,8 +3796,8 @@ heap_page_would_be_all_visible(Relation rel, Buffer buf,
 	 * cannot be all-visible.
 	 */
 	if (all_visible &&
-		TransactionIdIsNormal(*visibility_cutoff_xid) &&
-		GlobalVisTestXidMaybeRunning(vistest, *visibility_cutoff_xid))
+		TransactionIdIsNormal(*newest_live_xid) &&
+		GlobalVisTestXidMaybeRunning(vistest, *newest_live_xid))
 	{
 		all_visible = false;
 		*all_frozen = false;
-- 
2.43.0



  [text/x-patch] v41-0005-WAL-log-VM-setting-during-vacuum-phase-I-in-XLOG.patch (23.1K, 6-v41-0005-WAL-log-VM-setting-during-vacuum-phase-I-in-XLOG.patch)
  download | inline diff:
From b6975991b391b979bbffac9cc0bd8896ba181ba8 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Tue, 2 Dec 2025 15:07:42 -0500
Subject: [PATCH v41 05/12] WAL log VM setting during vacuum phase I in
 XLOG_HEAP2_PRUNE_VACUUM_SCAN

Vacuum no longer emits a separate WAL record for each page set
all-visible or all-frozen during phase I. Instead, visibility map
updates are now included in the XLOG_HEAP2_PRUNE_VACUUM_SCAN record that
is already emitted for pruning and freezing.

Previously, heap_page_prune_and_freeze() determined whether a page was
all-visible, but the corresponding VM bits were only set later in
lazy_scan_prune(). Now the VM is updated immediately in
heap_page_prune_and_freeze(), at the same time as the heap
modifications.

For now, this change applies only to vacuum phase I, not to pruning
performed during normal page access.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Robert Haas <[email protected]>
Reviewed-by: Kirill Reshke <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
---
 src/backend/access/heap/pruneheap.c  | 244 +++++++++++++++++++--------
 src/backend/access/heap/vacuumlazy.c | 113 ++-----------
 src/include/access/heapam.h          |  37 ++--
 3 files changed, 204 insertions(+), 190 deletions(-)

diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 9451e9417f7..dd8ac173ca1 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -72,6 +72,21 @@ typedef struct
 	OffsetNumber nowunused[MaxHeapTuplesPerPage];
 	HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
 
+	/*
+	 * set_all_visible and set_all_frozen indicate if the all-visible and
+	 * all-frozen bits in the visibility map can be set for this page after
+	 * pruning.
+	 *
+	 * NOTE: set_all_visible and set_all_frozen initially don't include
+	 * LP_DEAD items. That's convenient for heap_page_prune_and_freeze() to
+	 * use them to decide whether to opportunistically freeze the page or not.
+	 * The set_all_visible and set_all_frozen values ultimately used to set
+	 * the VM are adjusted to include LP_DEAD items after we determine whether
+	 * or not to opportunistically freeze.
+	 */
+	bool		set_all_visible;
+	bool		set_all_frozen;
+
 	/*-------------------------------------------------------
 	 * Working state for HOT chain processing
 	 *-------------------------------------------------------
@@ -122,12 +137,16 @@ typedef struct
 	/*
 	 * Caller must provide a pinned vmbuffer corresponding to the heap block
 	 * passed to heap_page_prune_and_freeze(). We will fix any corruption
-	 * found in the VM.
+	 * found in the VM and set the VM if the page is all-visible/all-frozen.
 	 */
 	Buffer		vmbuffer;
 
-	/* Bits in the vmbuffer for this heap page */
+	/*
+	 * The state of the VM bits at the beginning of pruning and the state they
+	 * will be in at the end.
+	 */
 	uint8		old_vmbits;
+	uint8		new_vmbits;
 
 	/* The newest xmin of live tuples on the page */
 	TransactionId newest_live_xid;
@@ -163,21 +182,6 @@ typedef struct
 	 */
 	int			lpdead_items;	/* number of items in the array */
 	OffsetNumber *deadoffsets;	/* points directly to presult->deadoffsets */
-
-	/*
-	 * set_all_visible and set_all_frozen indicate if the all-visible and
-	 * all-frozen bits in the visibility map can be set for this page after
-	 * pruning.
-	 *
-	 * NOTE: set_all_visible and set_all_frozen initially don't include
-	 * LP_DEAD items. That's convenient for heap_page_prune_and_freeze() to
-	 * use them to decide whether to freeze the page or not.  The
-	 * set_all_visible and set_all_frozen values returned to the caller are
-	 * adjusted to include LP_DEAD items after we determine whether to
-	 * opportunistically freeze.
-	 */
-	bool		set_all_visible;
-	bool		set_all_frozen;
 } PruneState;
 
 
@@ -232,6 +236,7 @@ static void page_verify_redirects(Page page);
 
 static bool heap_page_will_freeze(bool did_tuple_hint_fpi, bool do_prune, bool do_hint_prune,
 								  PruneState *prstate);
+static bool heap_page_will_set_vm(PruneState *prstate, PruneReason reason);
 
 
 /*
@@ -398,6 +403,7 @@ prune_freeze_setup(PruneFreezeParams *params,
 
 	Assert(BufferIsValid(params->vmbuffer));
 	prstate->vmbuffer = params->vmbuffer;
+	prstate->new_vmbits = 0;
 	prstate->old_vmbits = visibilitymap_get_status(prstate->relation,
 												   prstate->block,
 												   &prstate->vmbuffer);
@@ -915,6 +921,42 @@ heap_page_fix_vm_corruption(PruneState *prstate, OffsetNumber offnum,
 	prstate->old_vmbits = 0;
 }
 
+/*
+ * Decide whether to set the visibility map bits (all-visible and all-frozen)
+ * for heap_blk using information from the PruneState and VM.
+ *
+ * This function does not actually set the VM bits or page-level visibility
+ * hint, PD_ALL_VISIBLE.
+ *
+ * Returns true if one or both VM bits should be set and false otherwise.
+ */
+static bool
+heap_page_will_set_vm(PruneState *prstate, PruneReason reason)
+{
+	/*
+	 * Though on-access pruning maintains prstate->set_all_visible, we don't
+	 * set the VM for now.
+	 */
+	if (reason == PRUNE_ON_ACCESS)
+		return false;
+
+	if (!prstate->set_all_visible)
+		return false;
+
+	prstate->new_vmbits = VISIBILITYMAP_ALL_VISIBLE;
+
+	if (prstate->set_all_frozen)
+		prstate->new_vmbits |= VISIBILITYMAP_ALL_FROZEN;
+
+	if (prstate->new_vmbits == prstate->old_vmbits)
+	{
+		prstate->new_vmbits = 0;
+		return false;
+	}
+
+	return true;
+}
+
 /*
  * If the page is already all-frozen, or already all-visible and freezing
  * is not being attempted, there is no remaining work and we can bypass the
@@ -948,8 +990,6 @@ prune_freeze_bypass(PruneState *prstate, PruneFreezeResult *presult)
 	/* We'll fill in presult for the caller */
 	memset(presult, 0, sizeof(PruneFreezeResult));
 
-	presult->old_vmbits = prstate->old_vmbits;
-
 	/* Clear any stale prune hint */
 	if (TransactionIdIsValid(PageGetPruneXid(page)))
 	{
@@ -984,7 +1024,8 @@ prune_freeze_bypass(PruneState *prstate, PruneFreezeResult *presult)
 
 /*
  * Prune and repair fragmentation and potentially freeze tuples on the
- * specified page.
+ * specified page. If the page's visibility status has changed, update it in
+ * the VM.
  *
  * Caller must have pin and buffer cleanup lock on the page.  Note that we
  * don't update the FSM information for page on caller's behalf.  Caller might
@@ -999,12 +1040,10 @@ prune_freeze_bypass(PruneState *prstate, PruneFreezeResult *presult)
  * tuples if it's required in order to advance relfrozenxid / relminmxid, or
  * if it's considered advantageous for overall system performance to do so
  * now.  The 'params.cutoffs', 'presult', 'new_relfrozen_xid' and
- * 'new_relmin_mxid' arguments are required when freezing.  When
- * HEAP_PAGE_PRUNE_FREEZE option is passed, we also set
- * presult->set_all_visible and presult->set_all_frozen after determining
- * whether or not to opportunistically freeze, to indicate if the VM bits can
- * be set. 'all-frozen' is always set to false when the HEAP_PAGE_PRUNE_FREEZE
- * option is not passed.
+ * 'new_relmin_mxid' arguments are required when freezing.
+ *
+ * A vmbuffer corresponding to the heap page is also passed and if the page is
+ * found to be all-visible/all-frozen, we will set it in the VM.
  *
  * presult contains output parameters needed by callers, such as the number of
  * tuples removed and the offsets of dead items on the page after pruning.
@@ -1032,8 +1071,10 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
 	bool		do_freeze;
 	bool		do_prune;
 	bool		do_hint_prune;
+	bool		do_set_vm;
 	bool		did_tuple_hint_fpi;
 	int64		fpi_before = pgWalUsage.wal_fpi;
+	TransactionId conflict_xid;
 
 	/* Initialize prstate */
 	prune_freeze_setup(params,
@@ -1125,6 +1166,31 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
 		prstate.set_all_visible = prstate.set_all_frozen = false;
 
 	Assert(!prstate.set_all_frozen || prstate.set_all_visible);
+	Assert(!prstate.set_all_visible || (prstate.lpdead_items == 0));
+
+	do_set_vm = heap_page_will_set_vm(&prstate, params->reason);
+
+	/*
+	 * new_vmbits should be 0 regardless of whether or not the page is
+	 * all-visible if we do not intend to set the VM.
+	 */
+	Assert(do_set_vm || prstate.new_vmbits == 0);
+
+	/*
+	 * The snapshot conflict horizon for the whole record is the most
+	 * conservative (newest) horizon required by any change in the record.
+	 */
+	conflict_xid = InvalidTransactionId;
+	if (do_set_vm)
+		conflict_xid = prstate.newest_live_xid;
+	if (do_freeze && TransactionIdFollows(prstate.pagefrz.FreezePageConflictXid, conflict_xid))
+		conflict_xid = prstate.pagefrz.FreezePageConflictXid;
+	if (do_prune && TransactionIdFollows(prstate.latest_xid_removed, conflict_xid))
+		conflict_xid = prstate.latest_xid_removed;
+
+	/* Lock vmbuffer before entering a critical section */
+	if (do_set_vm)
+		LockBuffer(prstate.vmbuffer, BUFFER_LOCK_EXCLUSIVE);
 
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
@@ -1146,14 +1212,17 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
 
 		/*
 		 * If that's all we had to do to the page, this is a non-WAL-logged
-		 * hint.  If we are going to freeze or prune the page, we will mark
-		 * the buffer dirty below.
+		 * hint.  If we are going to freeze or prune the page or set
+		 * PD_ALL_VISIBLE, we will mark the buffer dirty below.
+		 *
+		 * Setting PD_ALL_VISIBLE is fully WAL-logged because it is forbidden
+		 * for the VM to be set and PD_ALL_VISIBLE to be clear.
 		 */
-		if (!do_freeze && !do_prune)
+		if (!do_freeze && !do_prune && !do_set_vm)
 			MarkBufferDirtyHint(prstate.buffer, true);
 	}
 
-	if (do_prune || do_freeze)
+	if (do_prune || do_freeze || do_set_vm)
 	{
 		/* Apply the planned item changes and repair page fragmentation. */
 		if (do_prune)
@@ -1167,6 +1236,27 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
 		if (do_freeze)
 			heap_freeze_prepared_tuples(prstate.buffer, prstate.frozen, prstate.nfrozen);
 
+		/* Set the visibility map and page visibility hint */
+		if (do_set_vm)
+		{
+			/*
+			 * While it is valid for PD_ALL_VISIBLE to be set when the
+			 * corresponding VM bit is clear, we strongly prefer to keep them
+			 * in sync.
+			 *
+			 * The heap buffer must be marked dirty before adding it to the
+			 * WAL chain when setting the VM. We don't worry about
+			 * unnecessarily dirtying the heap buffer if PD_ALL_VISIBLE is
+			 * already set, though. It is extremely rare to have a clean heap
+			 * buffer with PD_ALL_VISIBLE already set and the VM bits clear,
+			 * so there is no point in optimizing it.
+			 */
+			PageSetAllVisible(prstate.page);
+			PageClearPrunable(prstate.page);
+			visibilitymap_set_vmbits(prstate.block, prstate.vmbuffer, prstate.new_vmbits,
+									 prstate.relation->rd_locator);
+		}
+
 		MarkBufferDirty(prstate.buffer);
 
 		/*
@@ -1174,29 +1264,12 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
 		 */
 		if (RelationNeedsWAL(prstate.relation))
 		{
-			/*
-			 * The snapshotConflictHorizon for the whole record should be the
-			 * most conservative of all the horizons calculated for any of the
-			 * possible modifications. If this record will prune tuples, any
-			 * queries on the standby older than the newest xid of the most
-			 * recently removed tuple this record will prune will conflict. If
-			 * this record will freeze tuples, any queries on the standby with
-			 * xids older than the newest tuple this record will freeze will
-			 * conflict.
-			 */
-			TransactionId conflict_xid;
-
-			if (TransactionIdFollows(prstate.pagefrz.FreezePageConflictXid,
-									 prstate.latest_xid_removed))
-				conflict_xid = prstate.pagefrz.FreezePageConflictXid;
-			else
-				conflict_xid = prstate.latest_xid_removed;
-
 			log_heap_prune_and_freeze(prstate.relation, prstate.buffer,
-									  InvalidBuffer,	/* vmbuffer */
-									  0,	/* vmflags */
+									  do_set_vm ? prstate.vmbuffer : InvalidBuffer,
+									  do_set_vm ? prstate.new_vmbits : 0,
 									  conflict_xid,
-									  true, params->reason,
+									  true, /* cleanup lock */
+									  params->reason,
 									  prstate.frozen, prstate.nfrozen,
 									  prstate.redirected, prstate.nredirected,
 									  prstate.nowdead, prstate.ndead,
@@ -1206,33 +1279,70 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
 
 	END_CRIT_SECTION();
 
+	if (do_set_vm)
+		LockBuffer(prstate.vmbuffer, BUFFER_LOCK_UNLOCK);
+
+	/*
+	 * During its second pass over the heap, VACUUM calls
+	 * heap_page_would_be_all_visible() to determine whether a page is
+	 * all-visible and all-frozen. The logic here is similar. After completing
+	 * pruning and freezing, use an assertion to verify that our results
+	 * remain consistent with heap_page_would_be_all_visible().
+	 */
+#ifdef USE_ASSERT_CHECKING
+	if (prstate.set_all_visible)
+	{
+		TransactionId debug_cutoff;
+		bool		debug_all_frozen;
+
+		Assert(prstate.lpdead_items == 0);
+
+		Assert(heap_page_is_all_visible(prstate.relation, prstate.buffer,
+										prstate.vistest,
+										&debug_all_frozen,
+										&debug_cutoff, off_loc));
+
+		Assert(!TransactionIdIsValid(debug_cutoff) ||
+			   debug_cutoff == prstate.newest_live_xid);
+
+		/*
+		 * It's possible the page is composed entirely of frozen tuples but is
+		 * not set all-frozen in the VM and did not pass
+		 * HEAP_PAGE_PRUNE_FREEZE. In this case, it's possible
+		 * heap_page_is_all_visible() finds the page completely frozen, even
+		 * though prstate.set_all_frozen is false.
+		 */
+		Assert(!prstate.set_all_frozen || debug_all_frozen);
+	}
+#endif
+
 	/* Copy information back for caller */
 	presult->ndeleted = prstate.ndeleted;
 	presult->nnewlpdead = prstate.ndead;
 	presult->nfrozen = prstate.nfrozen;
 	presult->live_tuples = prstate.live_tuples;
 	presult->recently_dead_tuples = prstate.recently_dead_tuples;
-	presult->set_all_visible = prstate.set_all_visible;
-	presult->set_all_frozen = prstate.set_all_frozen;
 	presult->hastup = prstate.hastup;
-	presult->old_vmbits = prstate.old_vmbits;
-
-	/*
-	 * For callers planning to update the visibility map, the conflict horizon
-	 * for that record must be the newest xmin on the page.  However, if the
-	 * page is completely frozen, there can be no conflict and the
-	 * vm_conflict_horizon should remain InvalidTransactionId.  This includes
-	 * the case that we just froze all the tuples; the prune-freeze record
-	 * included the conflict XID already so the caller doesn't need it.
-	 */
-	if (presult->set_all_frozen)
-		presult->vm_conflict_horizon = InvalidTransactionId;
-	else
-		presult->vm_conflict_horizon = prstate.newest_live_xid;
 
 	presult->lpdead_items = prstate.lpdead_items;
 	/* the presult->deadoffsets array was already filled in */
 
+	presult->newly_all_visible = false;
+	presult->newly_all_frozen = false;
+	presult->newly_all_visible_frozen = false;
+	if (do_set_vm)
+	{
+		if ((prstate.old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
+		{
+			presult->newly_all_visible = true;
+			if (prstate.set_all_frozen)
+				presult->newly_all_visible_frozen = true;
+		}
+		else if ((prstate.old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
+				 prstate.set_all_frozen)
+			presult->newly_all_frozen = true;
+	}
+
 	if (prstate.attempt_freeze)
 	{
 		if (presult->nfrozen > 0)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8599dd7fcfa..d144e0f642b 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -466,13 +466,6 @@ static void dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *
 static void dead_items_reset(LVRelState *vacrel);
 static void dead_items_cleanup(LVRelState *vacrel);
 
-#ifdef USE_ASSERT_CHECKING
-static bool heap_page_is_all_visible(Relation rel, Buffer buf,
-									 GlobalVisState *vistest,
-									 bool *all_frozen,
-									 TransactionId *newest_live_xid,
-									 OffsetNumber *logging_offnum);
-#endif
 static bool heap_page_would_be_all_visible(Relation rel, Buffer buf,
 										   GlobalVisState *vistest,
 										   OffsetNumber *deadoffsets,
@@ -2021,8 +2014,6 @@ lazy_scan_prune(LVRelState *vacrel,
 		.vistest = vacrel->vistest,
 		.cutoffs = &vacrel->cutoffs,
 	};
-	uint8		old_vmbits = 0;
-	uint8		new_vmbits = 0;
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
 
@@ -2073,32 +2064,6 @@ lazy_scan_prune(LVRelState *vacrel,
 		vacrel->new_frozen_tuple_pages++;
 	}
 
-	/*
-	 * VACUUM will call heap_page_is_all_visible() during the second pass over
-	 * the heap to determine all_visible and all_frozen for the page -- this
-	 * is a specialized version of the logic from this function.  Now that
-	 * we've finished pruning and freezing, make sure that we're in total
-	 * agreement with heap_page_is_all_visible() using an assertion.
-	 */
-#ifdef USE_ASSERT_CHECKING
-	if (presult.set_all_visible)
-	{
-		TransactionId debug_cutoff;
-		bool		debug_all_frozen;
-
-		Assert(presult.lpdead_items == 0);
-
-		Assert(heap_page_is_all_visible(vacrel->rel, buf,
-										vacrel->vistest, &debug_all_frozen,
-										&debug_cutoff, &vacrel->offnum));
-
-		Assert(presult.set_all_frozen == debug_all_frozen);
-
-		Assert(!TransactionIdIsValid(debug_cutoff) ||
-			   debug_cutoff == presult.vm_conflict_horizon);
-	}
-#endif
-
 	/*
 	 * Now save details of the LP_DEAD items from the page in vacrel
 	 */
@@ -2119,6 +2084,17 @@ lazy_scan_prune(LVRelState *vacrel,
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
+	if (presult.newly_all_visible)
+		vacrel->new_all_visible_pages++;
+	if (presult.newly_all_visible_frozen)
+		vacrel->new_all_visible_all_frozen_pages++;
+	if (presult.newly_all_frozen)
+		vacrel->new_all_frozen_pages++;
+
+	/* Capture if the page was newly set frozen */
+	*vm_page_frozen = presult.newly_all_visible_frozen ||
+		presult.newly_all_frozen;
+
 	vacrel->tuples_deleted += presult.ndeleted;
 	vacrel->tuples_frozen += presult.nfrozen;
 	vacrel->lpdead_items += presult.lpdead_items;
@@ -2132,71 +2108,6 @@ lazy_scan_prune(LVRelState *vacrel,
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (presult.lpdead_items > 0);
 
-	Assert(!presult.set_all_visible || !(*has_lpdead_items));
-	Assert(!presult.set_all_frozen || presult.set_all_visible);
-
-	if (!presult.set_all_visible)
-		return presult.ndeleted;
-
-	/* Set the visibility map and page visibility hint */
-	old_vmbits = presult.old_vmbits;
-	new_vmbits = VISIBILITYMAP_ALL_VISIBLE;
-	if (presult.set_all_frozen)
-		new_vmbits |= VISIBILITYMAP_ALL_FROZEN;
-
-	/* Nothing to do */
-	if (old_vmbits == new_vmbits)
-		return presult.ndeleted;
-
-	/*
-	 * It should never be the case that the visibility map page is set while
-	 * the page-level bit is clear (and if so, we cleared it above), but the
-	 * reverse is allowed (if checksums are not enabled). Regardless, set both
-	 * bits so that we get back in sync.
-	 *
-	 * The heap buffer must be marked dirty before adding it to the WAL chain
-	 * when setting the VM. We don't worry about unnecessarily dirtying the
-	 * heap buffer if PD_ALL_VISIBLE is already set, though. It is extremely
-	 * rare to have a clean heap buffer with PD_ALL_VISIBLE already set and
-	 * the VM bits clear, so there is no point in optimizing it.
-	 */
-	PageSetAllVisible(page);
-	PageClearPrunable(page);
-	MarkBufferDirty(buf);
-
-	/*
-	 * If the page is being set all-frozen, we pass InvalidTransactionId as
-	 * the cutoff_xid, since a snapshot conflict horizon sufficient to make
-	 * everything safe for REDO was logged when the page's tuples were frozen.
-	 */
-	Assert(!presult.set_all_frozen ||
-		   !TransactionIdIsValid(presult.vm_conflict_horizon));
-
-	visibilitymap_set(vacrel->rel, blkno, buf,
-					  InvalidXLogRecPtr,
-					  vmbuffer, presult.vm_conflict_horizon,
-					  new_vmbits);
-
-	/*
-	 * If the page wasn't already set all-visible and/or all-frozen in the VM,
-	 * count it as newly set for logging.
-	 */
-	if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
-	{
-		vacrel->new_all_visible_pages++;
-		if (presult.set_all_frozen)
-		{
-			vacrel->new_all_visible_all_frozen_pages++;
-			*vm_page_frozen = true;
-		}
-	}
-	else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
-			 presult.set_all_frozen)
-	{
-		vacrel->new_all_frozen_pages++;
-		*vm_page_frozen = true;
-	}
-
 	return presult.ndeleted;
 }
 
@@ -3612,7 +3523,7 @@ dead_items_cleanup(LVRelState *vacrel)
  * that expect no LP_DEAD on the page. Currently assert-only, but there is no
  * reason not to use it outside of asserts.
  */
-static bool
+bool
 heap_page_is_all_visible(Relation rel, Buffer buf,
 						 GlobalVisState *vistest,
 						 bool *all_frozen,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f9dbd70c1c4..b9577c24844 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -265,7 +265,8 @@ typedef struct PruneFreezeParams
 
 	/*
 	 * Callers should provide a pinned vmbuffer corresponding to the heap
-	 * block in buffer. We will check for and repair any corruption in the VM.
+	 * block in buffer. We will check for and repair any corruption in the VM
+	 * and set the VM after pruning if the page is all-visible/all-frozen.
 	 */
 	Buffer		vmbuffer;
 
@@ -281,8 +282,7 @@ typedef struct PruneFreezeParams
 	 * HEAP_PAGE_PRUNE_MARK_UNUSED_NOW indicates that dead items can be set
 	 * LP_UNUSED during pruning.
 	 *
-	 * HEAP_PAGE_PRUNE_FREEZE indicates that we will also freeze tuples, and
-	 * will return 'all_visible', 'all_frozen' flags to the caller.
+	 * HEAP_PAGE_PRUNE_FREEZE indicates that we will also freeze tuples.
 	 */
 	int			options;
 
@@ -316,26 +316,12 @@ typedef struct PruneFreezeResult
 	int			recently_dead_tuples;
 
 	/*
-	 * set_all_visible and set_all_frozen indicate if the all-visible and
-	 * all-frozen bits in the visibility map should be set for this page after
-	 * pruning.
-	 *
-	 * vm_conflict_horizon is the newest xmin of live tuples on the page.  The
-	 * caller can use it as the conflict horizon when setting the VM bits.  It
-	 * is only valid if we froze some tuples (nfrozen > 0), and set_all_frozen
-	 * is true.
-	 *
-	 * These are only set if the HEAP_PAGE_PRUNE_FREEZE option is set.
-	 */
-	bool		set_all_visible;
-	bool		set_all_frozen;
-	TransactionId vm_conflict_horizon;
-
-	/*
-	 * The value of the vmbuffer's vmbits at the beginning of pruning. It is
-	 * cleared if VM corruption is found and corrected.
+	 * Whether or not the page was newly set all-visible and all-frozen during
+	 * phase I of vacuuming.
 	 */
-	uint8		old_vmbits;
+	bool		newly_all_visible;
+	bool		newly_all_visible_frozen;
+	bool		newly_all_frozen;
 
 	/*
 	 * Whether or not the page makes rel truncation unsafe.  This is set to
@@ -472,6 +458,13 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 /* in heap/vacuumlazy.c */
 extern void heap_vacuum_rel(Relation rel,
 							const VacuumParams params, BufferAccessStrategy bstrategy);
+#ifdef USE_ASSERT_CHECKING
+extern bool heap_page_is_all_visible(Relation rel, Buffer buf,
+									 GlobalVisState *vistest,
+									 bool *all_frozen,
+									 TransactionId *visibility_cutoff_xid,
+									 OffsetNumber *logging_offnum);
+#endif
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
-- 
2.43.0



  [text/x-patch] v41-0006-WAL-log-VM-setting-for-empty-pages-in-XLOG_HEAP2.patch (2.7K, 7-v41-0006-WAL-log-VM-setting-for-empty-pages-in-XLOG_HEAP2.patch)
  download | inline diff:
From 561637633c1417af7dea0509bbbf55dca3c2fead Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Sat, 27 Sep 2025 11:55:21 -0400
Subject: [PATCH v41 06/12] WAL log VM setting for empty pages in
 XLOG_HEAP2_PRUNE_VACUUM_SCAN

As part of removing XLOG_HEAP2_VISIBLE records, phase I of VACUUM now
marks empty pages all-visible in a XLOG_HEAP2_PRUNE_VACUUM_SCAN record.

This has no independent benefit, but empty pages were the last user of
XLOG_HEAP2_VISIBLE, so if we make this change we can remove all of the
XLOH_HEAP2_VISIBLE code.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Robert Haas <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Chao Li <[email protected]>
---
 src/backend/access/heap/vacuumlazy.c | 35 +++++++++++++++++++++++-----
 1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d144e0f642b..de93bff4a8e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1928,9 +1928,12 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
 		 */
 		if (!PageIsAllVisible(page))
 		{
+			/* Lock vmbuffer before entering critical section */
+			LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
+
 			START_CRIT_SECTION();
 
-			/* mark buffer dirty before writing a WAL record */
+			/* Mark buffer dirty before writing any WAL records */
 			MarkBufferDirty(buf);
 
 			/*
@@ -1948,13 +1951,33 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
 
 			PageSetAllVisible(page);
 			PageClearPrunable(page);
-			visibilitymap_set(vacrel->rel, blkno, buf,
-							  InvalidXLogRecPtr,
-							  vmbuffer, InvalidTransactionId,
-							  VISIBILITYMAP_ALL_VISIBLE |
-							  VISIBILITYMAP_ALL_FROZEN);
+			visibilitymap_set_vmbits(blkno,
+									 vmbuffer,
+									 VISIBILITYMAP_ALL_VISIBLE |
+									 VISIBILITYMAP_ALL_FROZEN,
+									 vacrel->rel->rd_locator);
+
+			/*
+			 * Emit WAL for setting PD_ALL_VISIBLE on the heap page and
+			 * setting the VM.
+			 */
+			if (RelationNeedsWAL(vacrel->rel))
+				log_heap_prune_and_freeze(vacrel->rel, buf,
+										  vmbuffer,
+										  VISIBILITYMAP_ALL_VISIBLE |
+										  VISIBILITYMAP_ALL_FROZEN,
+										  InvalidTransactionId, /* conflict xid */
+										  false,	/* cleanup lock */
+										  PRUNE_VACUUM_SCAN,	/* reason */
+										  NULL, 0,
+										  NULL, 0,
+										  NULL, 0,
+										  NULL, 0);
+
 			END_CRIT_SECTION();
 
+			LockBuffer(vmbuffer, BUFFER_LOCK_UNLOCK);
+
 			/* Count the newly all-frozen pages for logging */
 			vacrel->new_all_visible_pages++;
 			vacrel->new_all_visible_all_frozen_pages++;
-- 
2.43.0



  [text/x-patch] v41-0007-Remove-XLOG_HEAP2_VISIBLE-entirely.patch (27.5K, 8-v41-0007-Remove-XLOG_HEAP2_VISIBLE-entirely.patch)
  download | inline diff:
From 8c63ade694ce8ed5bcb2d67c15203f0bd41d6b3f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Sat, 27 Sep 2025 11:55:36 -0400
Subject: [PATCH v41 07/12] Remove XLOG_HEAP2_VISIBLE entirely

There are no remaining users that emit XLOG_HEAP2_VISIBLE records, so it
can be removed. This includes deleting the xl_heap_visible struct and
all functions responsible for emitting or replaying XLOG_HEAP2_VISIBLE
records.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Andrey Borodin <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Chao Li <[email protected]>
---
 src/backend/access/common/bufmask.c      |   5 +-
 src/backend/access/heap/heapam.c         |  54 +-------
 src/backend/access/heap/heapam_xlog.c    | 156 ++---------------------
 src/backend/access/heap/pruneheap.c      |   4 +-
 src/backend/access/heap/vacuumlazy.c     |  16 +--
 src/backend/access/heap/visibilitymap.c  | 150 +++++-----------------
 src/backend/access/rmgrdesc/heapdesc.c   |  10 --
 src/backend/replication/logical/decode.c |   1 -
 src/backend/storage/ipc/standby.c        |   9 +-
 src/include/access/heapam_xlog.h         |  21 +--
 src/include/access/visibilitymap.h       |  13 +-
 src/include/access/visibilitymapdefs.h   |   9 --
 src/tools/pgindent/typedefs.list         |   1 -
 13 files changed, 63 insertions(+), 386 deletions(-)

diff --git a/src/backend/access/common/bufmask.c b/src/backend/access/common/bufmask.c
index 8a67bfa1aff..d64c403f2f0 100644
--- a/src/backend/access/common/bufmask.c
+++ b/src/backend/access/common/bufmask.c
@@ -55,9 +55,8 @@ mask_page_hint_bits(Page page)
 	PageClearHasFreeLinePointers(page);
 
 	/*
-	 * During replay, if the page LSN has advanced past our XLOG record's LSN,
-	 * we don't mark the page all-visible. See heap_xlog_visible() for
-	 * details.
+	 * XXX: We should consider not masking PD_ALL_VISIBLE during WAL
+	 * consistency checking.
 	 */
 	PageClearAllVisible(page);
 }
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e5bd062de77..044f385e477 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2589,11 +2589,11 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		{
 			PageSetAllVisible(page);
 			PageClearPrunable(page);
-			visibilitymap_set_vmbits(BufferGetBlockNumber(buffer),
-									 vmbuffer,
-									 VISIBILITYMAP_ALL_VISIBLE |
-									 VISIBILITYMAP_ALL_FROZEN,
-									 relation->rd_locator);
+			visibilitymap_set(BufferGetBlockNumber(buffer),
+							  vmbuffer,
+							  VISIBILITYMAP_ALL_VISIBLE |
+							  VISIBILITYMAP_ALL_FROZEN,
+							  relation->rd_locator);
 		}
 
 		/*
@@ -8886,50 +8886,6 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
 	return nblocksfavorable;
 }
 
-/*
- * Perform XLogInsert for a heap-visible operation.  'block' is the block
- * being marked all-visible, and vm_buffer is the buffer containing the
- * corresponding visibility map block.  Both should have already been modified
- * and dirtied.
- *
- * snapshotConflictHorizon comes from the largest xmin on the page being
- * marked all-visible.  REDO routine uses it to generate recovery conflicts.
- *
- * If checksums or wal_log_hints are enabled, we may also generate a full-page
- * image of heap_buffer. Otherwise, we optimize away the FPI (by specifying
- * REGBUF_NO_IMAGE for the heap buffer), in which case the caller should *not*
- * update the heap page's LSN.
- */
-XLogRecPtr
-log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
-				 TransactionId snapshotConflictHorizon, uint8 vmflags)
-{
-	xl_heap_visible xlrec;
-	XLogRecPtr	recptr;
-	uint8		flags;
-
-	Assert(BufferIsValid(heap_buffer));
-	Assert(BufferIsValid(vm_buffer));
-
-	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
-	xlrec.flags = vmflags;
-	if (RelationIsAccessibleInLogicalDecoding(rel))
-		xlrec.flags |= VISIBILITYMAP_XLOG_CATALOG_REL;
-	XLogBeginInsert();
-	XLogRegisterData(&xlrec, SizeOfHeapVisible);
-
-	XLogRegisterBuffer(0, vm_buffer, 0);
-
-	flags = REGBUF_STANDARD;
-	if (!XLogHintBitIsNeeded())
-		flags |= REGBUF_NO_IMAGE;
-	XLogRegisterBuffer(1, heap_buffer, flags);
-
-	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_VISIBLE);
-
-	return recptr;
-}
-
 /*
  * Perform XLogInsert for a heap-update operation.  Caller must already
  * have modified the buffer(s) and marked them dirty.
diff --git a/src/backend/access/heap/heapam_xlog.c b/src/backend/access/heap/heapam_xlog.c
index 1da774c1536..1302bb13e18 100644
--- a/src/backend/access/heap/heapam_xlog.c
+++ b/src/backend/access/heap/heapam_xlog.c
@@ -239,7 +239,7 @@ heap_xlog_prune_freeze(XLogReaderState *record)
 		if (PageIsNew(vmpage))
 			PageInit(vmpage, BLCKSZ, 0);
 
-		visibilitymap_set_vmbits(blkno, vmbuffer, vmflags, rlocator);
+		visibilitymap_set(blkno, vmbuffer, vmflags, rlocator);
 
 		Assert(BufferIsDirty(vmbuffer));
 		PageSetLSN(vmpage, lsn);
@@ -252,143 +252,6 @@ heap_xlog_prune_freeze(XLogReaderState *record)
 		XLogRecordPageWithFreeSpace(rlocator, blkno, freespace);
 }
 
-/*
- * Replay XLOG_HEAP2_VISIBLE records.
- *
- * The critical integrity requirement here is that we must never end up with
- * a situation where the visibility map bit is set, and the page-level
- * PD_ALL_VISIBLE bit is clear.  If that were to occur, then a subsequent
- * page modification would fail to clear the visibility map bit.
- */
-static void
-heap_xlog_visible(XLogReaderState *record)
-{
-	XLogRecPtr	lsn = record->EndRecPtr;
-	xl_heap_visible *xlrec = (xl_heap_visible *) XLogRecGetData(record);
-	Buffer		vmbuffer = InvalidBuffer;
-	Buffer		buffer;
-	Page		page;
-	RelFileLocator rlocator;
-	BlockNumber blkno;
-	XLogRedoAction action;
-
-	Assert((xlrec->flags & VISIBILITYMAP_XLOG_VALID_BITS) == xlrec->flags);
-
-	XLogRecGetBlockTag(record, 1, &rlocator, NULL, &blkno);
-
-	/*
-	 * If there are any Hot Standby transactions running that have an xmin
-	 * horizon old enough that this page isn't all-visible for them, they
-	 * might incorrectly decide that an index-only scan can skip a heap fetch.
-	 *
-	 * NB: It might be better to throw some kind of "soft" conflict here that
-	 * forces any index-only scan that is in flight to perform heap fetches,
-	 * rather than killing the transaction outright.
-	 */
-	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
-											xlrec->flags & VISIBILITYMAP_XLOG_CATALOG_REL,
-											rlocator);
-
-	/*
-	 * Read the heap page, if it still exists. If the heap file has dropped or
-	 * truncated later in recovery, we don't need to update the page, but we'd
-	 * better still update the visibility map.
-	 */
-	action = XLogReadBufferForRedo(record, 1, &buffer);
-	if (action == BLK_NEEDS_REDO)
-	{
-		/*
-		 * We don't bump the LSN of the heap page when setting the visibility
-		 * map bit (unless checksums or wal_hint_bits is enabled, in which
-		 * case we must). This exposes us to torn page hazards, but since
-		 * we're not inspecting the existing page contents in any way, we
-		 * don't care.
-		 */
-		page = BufferGetPage(buffer);
-
-		PageSetAllVisible(page);
-		PageClearPrunable(page);
-
-		if (XLogHintBitIsNeeded())
-			PageSetLSN(page, lsn);
-
-		MarkBufferDirty(buffer);
-	}
-	else if (action == BLK_RESTORED)
-	{
-		/*
-		 * If heap block was backed up, we already restored it and there's
-		 * nothing more to do. (This can only happen with checksums or
-		 * wal_log_hints enabled.)
-		 */
-	}
-
-	if (BufferIsValid(buffer))
-	{
-		Size		space = PageGetFreeSpace(BufferGetPage(buffer));
-
-		UnlockReleaseBuffer(buffer);
-
-		/*
-		 * Since FSM is not WAL-logged and only updated heuristically, it
-		 * easily becomes stale in standbys.  If the standby is later promoted
-		 * and runs VACUUM, it will skip updating individual free space
-		 * figures for pages that became all-visible (or all-frozen, depending
-		 * on the vacuum mode,) which is troublesome when FreeSpaceMapVacuum
-		 * propagates too optimistic free space values to upper FSM layers;
-		 * later inserters try to use such pages only to find out that they
-		 * are unusable.  This can cause long stalls when there are many such
-		 * pages.
-		 *
-		 * Forestall those problems by updating FSM's idea about a page that
-		 * is becoming all-visible or all-frozen.
-		 *
-		 * Do this regardless of a full-page image being applied, since the
-		 * FSM data is not in the page anyway.
-		 */
-		if (xlrec->flags & VISIBILITYMAP_VALID_BITS)
-			XLogRecordPageWithFreeSpace(rlocator, blkno, space);
-	}
-
-	/*
-	 * Even if we skipped the heap page update due to the LSN interlock, it's
-	 * still safe to update the visibility map.  Any WAL record that clears
-	 * the visibility map bit does so before checking the page LSN, so any
-	 * bits that need to be cleared will still be cleared.
-	 */
-	if (XLogReadBufferForRedoExtended(record, 0, RBM_ZERO_ON_ERROR, false,
-									  &vmbuffer) == BLK_NEEDS_REDO)
-	{
-		Page		vmpage = BufferGetPage(vmbuffer);
-		Relation	reln;
-		uint8		vmbits;
-
-		/* initialize the page if it was read as zeros */
-		if (PageIsNew(vmpage))
-			PageInit(vmpage, BLCKSZ, 0);
-
-		/* remove VISIBILITYMAP_XLOG_* */
-		vmbits = xlrec->flags & VISIBILITYMAP_VALID_BITS;
-
-		/*
-		 * XLogReadBufferForRedoExtended locked the buffer. But
-		 * visibilitymap_set will handle locking itself.
-		 */
-		LockBuffer(vmbuffer, BUFFER_LOCK_UNLOCK);
-
-		reln = CreateFakeRelcacheEntry(rlocator);
-
-		visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
-						  xlrec->snapshotConflictHorizon, vmbits);
-
-		ReleaseBuffer(vmbuffer);
-		FreeFakeRelcacheEntry(reln);
-	}
-	else if (BufferIsValid(vmbuffer))
-		UnlockReleaseBuffer(vmbuffer);
-}
-
 /*
  * Given an "infobits" field from an XLog record, set the correct bits in the
  * given infomask and infomask2 for the tuple touched by the record.
@@ -769,8 +632,8 @@ heap_xlog_multi_insert(XLogReaderState *record)
 	 *
 	 * During recovery, however, no concurrent writers exist. Therefore,
 	 * updating the VM without holding the heap page lock is safe enough. This
-	 * same approach is taken when replaying xl_heap_visible records (see
-	 * heap_xlog_visible()).
+	 * same approach is taken when replaying XLOG_HEAP2_PRUNE* records (see
+	 * heap_xlog_prune_freeze()).
 	 */
 	if ((xlrec->flags & XLH_INSERT_ALL_FROZEN_SET) &&
 		XLogReadBufferForRedoExtended(record, 1, RBM_ZERO_ON_ERROR, false,
@@ -782,11 +645,11 @@ heap_xlog_multi_insert(XLogReaderState *record)
 		if (PageIsNew(vmpage))
 			PageInit(vmpage, BLCKSZ, 0);
 
-		visibilitymap_set_vmbits(blkno,
-								 vmbuffer,
-								 VISIBILITYMAP_ALL_VISIBLE |
-								 VISIBILITYMAP_ALL_FROZEN,
-								 rlocator);
+		visibilitymap_set(blkno,
+						  vmbuffer,
+						  VISIBILITYMAP_ALL_VISIBLE |
+						  VISIBILITYMAP_ALL_FROZEN,
+						  rlocator);
 
 		Assert(BufferIsDirty(vmbuffer));
 		PageSetLSN(vmpage, lsn);
@@ -1369,9 +1232,6 @@ heap2_redo(XLogReaderState *record)
 		case XLOG_HEAP2_PRUNE_VACUUM_CLEANUP:
 			heap_xlog_prune_freeze(record);
 			break;
-		case XLOG_HEAP2_VISIBLE:
-			heap_xlog_visible(record);
-			break;
 		case XLOG_HEAP2_MULTI_INSERT:
 			heap_xlog_multi_insert(record);
 			break;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index dd8ac173ca1..fac7194dcba 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -1253,8 +1253,8 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
 			 */
 			PageSetAllVisible(prstate.page);
 			PageClearPrunable(prstate.page);
-			visibilitymap_set_vmbits(prstate.block, prstate.vmbuffer, prstate.new_vmbits,
-									 prstate.relation->rd_locator);
+			visibilitymap_set(prstate.block, prstate.vmbuffer, prstate.new_vmbits,
+							  prstate.relation->rd_locator);
 		}
 
 		MarkBufferDirty(prstate.buffer);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index de93bff4a8e..461fdf4ed83 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1951,11 +1951,11 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
 
 			PageSetAllVisible(page);
 			PageClearPrunable(page);
-			visibilitymap_set_vmbits(blkno,
-									 vmbuffer,
-									 VISIBILITYMAP_ALL_VISIBLE |
-									 VISIBILITYMAP_ALL_FROZEN,
-									 vacrel->rel->rd_locator);
+			visibilitymap_set(blkno,
+							  vmbuffer,
+							  VISIBILITYMAP_ALL_VISIBLE |
+							  VISIBILITYMAP_ALL_FROZEN,
+							  vacrel->rel->rd_locator);
 
 			/*
 			 * Emit WAL for setting PD_ALL_VISIBLE on the heap page and
@@ -2833,9 +2833,9 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 		 */
 		PageSetAllVisible(page);
 		PageClearPrunable(page);
-		visibilitymap_set_vmbits(blkno,
-								 vmbuffer, vmflags,
-								 vacrel->rel->rd_locator);
+		visibilitymap_set(blkno,
+						  vmbuffer, vmflags,
+						  vacrel->rel->rd_locator);
 		conflict_xid = newest_live_xid;
 	}
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e21b96281a6..21e89c38f0a 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -14,8 +14,7 @@
  *		visibilitymap_clear  - clear bits for one page in the visibility map
  *		visibilitymap_pin	 - pin a map page for setting a bit
  *		visibilitymap_pin_ok - check whether correct map page is already pinned
- *		visibilitymap_set	 - set bit(s) in a previously pinned page and log
- *		visibilitymap_set_vmbits - set bit(s) in a pinned page
+ *		visibilitymap_set	 - set bit(s) in a previously pinned page
  *		visibilitymap_get_status - get status of bits
  *		visibilitymap_count  - count number of bits set in visibility map
  *		visibilitymap_prepare_truncate -
@@ -35,21 +34,31 @@
  * is set, we know the condition is true, but if a bit is not set, it might or
  * might not be true.
  *
- * Clearing visibility map bits is not separately WAL-logged.  The callers
- * must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
- *
- * When we *set* a visibility map during VACUUM, we must write WAL.  This may
- * seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain.  The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit.  If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page.  Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * Changes to the visibility map bits are not separately WAL-logged. Callers
+ * must make sure that whenever a visibility map bit is cleared, the bit is
+ * cleared on WAL replay of the updating operation. And whenever a visibility
+ * map bit is set, the bit is set on WAL replay of the operation that rendered
+ * the page all-visible/all-frozen.
+ *
+ * The visibility map bits operate as a hint in one direction: if they are
+ * clear, it may still be the case that every tuple on the page is visible to
+ * all transactions (we just don't know that for certain). However, if they
+ * are set, we may skip vacuuming pages and incorrectly advance relfrozenxid
+ * or skip reading heap pages for an index-only scan and return wrong results.
+ *
+ * Additionally, it is critical that the heap-page level PD_ALL_VISIBLE bit be
+ * correctly set and cleared along with the VM bits.
+ *
+ * When clearing the VM, if a crash occurs after the heap page makes it to
+ * disk but before the VM page makes it to disk, replay must clear the VM or
+ * the next index-only scan can return wrong results or vacuum may incorrectly
+ * advance relfrozenxid.
+ *
+ * When setting the VM, if a crash occurs after the visibility map page makes
+ * it to disk and before the updated heap page makes it to disk, redo must set
+ * the bit on the heap page. Otherwise, the next insert, update, or delete on
+ * the heap page will fail to realize that the visibility map bit must be
+ * cleared, possibly causing index-only scans to return wrong answers.
  *
  * VACUUM will normally skip pages for which the visibility map bit is set;
  * such pages can't contain any dead tuples and therefore don't need vacuuming.
@@ -222,112 +231,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf)
 	return BufferIsValid(vmbuf) && BufferGetBlockNumber(vmbuf) == mapBlock;
 }
 
-/*
- *	visibilitymap_set - set bit(s) on a previously pinned page
- *
- * recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
- * or InvalidXLogRecPtr in normal running.  The VM page LSN is advanced to the
- * one provided; in normal running, we generate a new XLOG record and set the
- * page LSN to that value (though the heap page's LSN may *not* be updated;
- * see below).  cutoff_xid is the largest xmin on the page being marked
- * all-visible; it is needed for Hot Standby, and can be InvalidTransactionId
- * if the page contains no tuples.  It can also be set to InvalidTransactionId
- * when a page that is already all-visible is being marked all-frozen.
- *
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
- *
- * You must pass a buffer containing the correct map page to this function.
- * Call visibilitymap_pin first to pin the right one. This function doesn't do
- * any I/O.
- */
-void
-visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
-				  XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
-				  uint8 flags)
-{
-	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
-	uint32		mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
-	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
-	Page		page;
-	uint8	   *map;
-	uint8		status;
-
-#ifdef TRACE_VISIBILITYMAP
-	elog(DEBUG1, "vm_set flags 0x%02X for %s %d",
-		 flags, RelationGetRelationName(rel), heapBlk);
-#endif
-
-	Assert(InRecovery || !XLogRecPtrIsValid(recptr));
-	Assert(InRecovery || PageIsAllVisible(BufferGetPage(heapBuf)));
-	Assert((flags & VISIBILITYMAP_VALID_BITS) == flags);
-
-	/* Must never set all_frozen bit without also setting all_visible bit */
-	Assert(flags != VISIBILITYMAP_ALL_FROZEN);
-
-	/* Check that we have the right heap page pinned, if present */
-	if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
-		elog(ERROR, "wrong heap buffer passed to visibilitymap_set");
-
-	Assert(!BufferIsValid(heapBuf) ||
-		   BufferIsLockedByMeInMode(heapBuf, BUFFER_LOCK_EXCLUSIVE));
-
-	/* Check that we have the right VM page pinned */
-	if (!BufferIsValid(vmBuf) || BufferGetBlockNumber(vmBuf) != mapBlock)
-		elog(ERROR, "wrong VM buffer passed to visibilitymap_set");
-
-	page = BufferGetPage(vmBuf);
-	map = (uint8 *) PageGetContents(page);
-	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
-
-	status = (map[mapByte] >> mapOffset) & VISIBILITYMAP_VALID_BITS;
-	if (flags != status)
-	{
-		START_CRIT_SECTION();
-
-		map[mapByte] |= (flags << mapOffset);
-		MarkBufferDirty(vmBuf);
-
-		if (RelationNeedsWAL(rel))
-		{
-			if (!XLogRecPtrIsValid(recptr))
-			{
-				Assert(!InRecovery);
-				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
-
-				/*
-				 * If data checksums are enabled (or wal_log_hints=on), we
-				 * need to protect the heap page from being torn.
-				 *
-				 * If not, then we must *not* update the heap page's LSN. In
-				 * this case, the FPI for the heap page was omitted from the
-				 * WAL record inserted above, so it would be incorrect to
-				 * update the heap page's LSN.
-				 */
-				if (XLogHintBitIsNeeded())
-				{
-					Page		heapPage = BufferGetPage(heapBuf);
-
-					PageSetLSN(heapPage, recptr);
-				}
-			}
-			PageSetLSN(page, recptr);
-		}
-
-		END_CRIT_SECTION();
-	}
-
-	LockBuffer(vmBuf, BUFFER_LOCK_UNLOCK);
-}
-
 /*
  * Set VM (visibility map) flags in the VM block in vmBuf.
  *
  * This function is intended for callers that log VM changes together
  * with the heap page modifications that rendered the page all-visible.
- * Callers that log VM changes separately should use visibilitymap_set().
  *
  * vmBuf must be pinned and exclusively locked, and it must cover the VM bits
  * corresponding to heapBlk.
@@ -343,9 +251,9 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
  * rlocator is used only for debugging messages.
  */
 void
-visibilitymap_set_vmbits(BlockNumber heapBlk,
-						 Buffer vmBuf, uint8 flags,
-						 const RelFileLocator rlocator)
+visibilitymap_set(BlockNumber heapBlk,
+				  Buffer vmBuf, uint8 flags,
+				  const RelFileLocator rlocator)
 {
 	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
 	uint32		mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index 02ae91653c1..75ae6f9d375 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -349,13 +349,6 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
 			}
 		}
 	}
-	else if (info == XLOG_HEAP2_VISIBLE)
-	{
-		xl_heap_visible *xlrec = (xl_heap_visible *) rec;
-
-		appendStringInfo(buf, "snapshotConflictHorizon: %u, flags: 0x%02X",
-						 xlrec->snapshotConflictHorizon, xlrec->flags);
-	}
 	else if (info == XLOG_HEAP2_MULTI_INSERT)
 	{
 		xl_heap_multi_insert *xlrec = (xl_heap_multi_insert *) rec;
@@ -461,9 +454,6 @@ heap2_identify(uint8 info)
 		case XLOG_HEAP2_PRUNE_VACUUM_CLEANUP:
 			id = "PRUNE_VACUUM_CLEANUP";
 			break;
-		case XLOG_HEAP2_VISIBLE:
-			id = "VISIBLE";
-			break;
 		case XLOG_HEAP2_MULTI_INSERT:
 			id = "MULTI_INSERT";
 			break;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 21f03864a66..3c027bcb2f7 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -448,7 +448,6 @@ heap2_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_HEAP2_PRUNE_ON_ACCESS:
 		case XLOG_HEAP2_PRUNE_VACUUM_SCAN:
 		case XLOG_HEAP2_PRUNE_VACUUM_CLEANUP:
-		case XLOG_HEAP2_VISIBLE:
 		case XLOG_HEAP2_LOCK_UPDATED:
 			break;
 		default:
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index f3ad90c7c7a..de9092fdf5b 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -476,10 +476,11 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 	/*
 	 * If we get passed InvalidTransactionId then we do nothing (no conflict).
 	 *
-	 * This can happen when replaying already-applied WAL records after a
-	 * standby crash or restart, or when replaying an XLOG_HEAP2_VISIBLE
-	 * record that marks as frozen a page which was already all-visible.  It's
-	 * also quite common with records generated during index deletion
+	 * This can happen whenever the changes in the WAL record do not affect
+	 * visibility on a standby. For example: a record that only freezes an
+	 * xmax from a locker.
+	 *
+	 * It's also quite common with records generated during index deletion
 	 * (original execution of the deletion can reason that a recovery conflict
 	 * which is sufficient for the deletion operation must take place before
 	 * replay of the deletion record itself).
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index ce3566ba949..516806fcca2 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -60,7 +60,7 @@
 #define XLOG_HEAP2_PRUNE_ON_ACCESS		0x10
 #define XLOG_HEAP2_PRUNE_VACUUM_SCAN	0x20
 #define XLOG_HEAP2_PRUNE_VACUUM_CLEANUP	0x30
-#define XLOG_HEAP2_VISIBLE		0x40
+/* 0x40 was XLOG_HEAP2_VISIBLE */
 #define XLOG_HEAP2_MULTI_INSERT 0x50
 #define XLOG_HEAP2_LOCK_UPDATED 0x60
 #define XLOG_HEAP2_NEW_CID		0x70
@@ -443,20 +443,6 @@ typedef struct xl_heap_inplace
 
 #define MinSizeOfHeapInplace	(offsetof(xl_heap_inplace, nmsgs) + sizeof(int))
 
-/*
- * This is what we need to know about setting a visibility map bit
- *
- * Backup blk 0: visibility map buffer
- * Backup blk 1: heap buffer
- */
-typedef struct xl_heap_visible
-{
-	TransactionId snapshotConflictHorizon;
-	uint8		flags;
-} xl_heap_visible;
-
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
-
 typedef struct xl_heap_new_cid
 {
 	/*
@@ -500,11 +486,6 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
-								   Buffer vm_buffer,
-								   TransactionId snapshotConflictHorizon,
-								   uint8 vmflags);
-
 /* in heapdesc.c, so it can be shared between frontend/backend code */
 extern void heap_xlog_deserialize_prune_and_freeze(char *cursor, uint16 flags,
 												   int *nplans, xlhp_freeze_plan **plans,
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 52cde56be86..e4e0cfa989e 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -15,7 +15,6 @@
 #define VISIBILITYMAP_H
 
 #include "access/visibilitymapdefs.h"
-#include "access/xlogdefs.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/relfilelocator.h"
@@ -32,15 +31,9 @@ extern bool visibilitymap_clear(Relation rel, BlockNumber heapBlk,
 extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
 							  Buffer *vmbuf);
 extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
-extern void visibilitymap_set(Relation rel,
-							  BlockNumber heapBlk, Buffer heapBuf,
-							  XLogRecPtr recptr,
-							  Buffer vmBuf,
-							  TransactionId cutoff_xid,
-							  uint8 flags);
-extern void visibilitymap_set_vmbits(BlockNumber heapBlk,
-									 Buffer vmBuf, uint8 flags,
-									 const RelFileLocator rlocator);
+extern void visibilitymap_set(BlockNumber heapBlk,
+							  Buffer vmBuf, uint8 flags,
+							  const RelFileLocator rlocator);
 extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
 extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
 extern BlockNumber visibilitymap_prepare_truncate(Relation rel,
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 89153b3cd9a..e5794c8559e 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -21,14 +21,5 @@
 #define VISIBILITYMAP_ALL_FROZEN	0x02
 #define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
 											 * flags bits */
-/*
- * To detect recovery conflicts during logical decoding on a standby, we need
- * to know if a table is a user catalog table. For that we add an additional
- * bit into xl_heap_visible.flags, in addition to the above.
- *
- * NB: VISIBILITYMAP_XLOG_* may not be passed to visibilitymap_set().
- */
-#define VISIBILITYMAP_XLOG_CATALOG_REL	0x04
-#define VISIBILITYMAP_XLOG_VALID_BITS	(VISIBILITYMAP_VALID_BITS | VISIBILITYMAP_XLOG_CATALOG_REL)
 
 #endif							/* VISIBILITYMAPDEFS_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 480614d483b..f12f2deec43 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4416,7 +4416,6 @@ xl_heap_prune
 xl_heap_rewrite_mapping
 xl_heap_truncate
 xl_heap_update
-xl_heap_visible
 xl_invalid_page
 xl_invalid_page_key
 xl_invalidations
-- 
2.43.0



  [text/x-patch] v41-0008-Track-which-relations-are-modified-by-a-query.patch (8.7K, 9-v41-0008-Track-which-relations-are-modified-by-a-query.patch)
  download | inline diff:
From f5de33c173ca5216ea042475ec8ee2d8f0ddd3a9 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Wed, 3 Dec 2025 15:07:24 -0500
Subject: [PATCH v41 08/12] Track which relations are modified by a query

Save the relids of modified relations in a bitmap in the PlannedStmt.
A later commit will pass this information down to scan nodes to control
whether or not on-access pruning is allowed to set the visibility map.
Setting the visibility map during a scan is counterproductive if the
query is going to modify the page immediately after.

Relations are considered modified if they are the target of INSERT,
UPDATE, DELETE, or MERGE, or if they have any row mark (including SELECT
FOR UPDATE/SHARE). All row mark types are included, even those which
don't actually modify tuples, because this bitmap is only used as a hint
to avoid unnecessary work.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://postgr.es/m/F5CDD1B5-628C-44A1-9F85-3958C626F6A9%40gmail.com
---
 src/backend/executor/execMain.c        | 47 ++++++++++++++++++++++++++
 src/backend/executor/execParallel.c    |  1 +
 src/backend/executor/nodeLockRows.c    |  4 +++
 src/backend/executor/nodeModifyTable.c | 18 ++++++++++
 src/backend/optimizer/plan/planner.c   | 21 +++++++++++-
 src/include/nodes/plannodes.h          | 10 ++++++
 6 files changed, 100 insertions(+), 1 deletion(-)

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 58b84955c2b..3f134f9a34d 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -90,6 +90,9 @@ static bool ExecCheckPermissionsModified(Oid relOid, Oid userid,
 										 Bitmapset *modifiedCols,
 										 AclMode requiredPerms);
 static void ExecCheckXactReadOnly(PlannedStmt *plannedstmt);
+#ifdef USE_ASSERT_CHECKING
+static void ExecCheckModifiedRelIds(EState *estate);
+#endif
 static void EvalPlanQualStart(EPQState *epqstate, Plan *planTree);
 static void ReportNotNullViolationError(ResultRelInfo *resultRelInfo,
 										TupleTableSlot *slot,
@@ -827,6 +830,46 @@ ExecCheckXactReadOnly(PlannedStmt *plannedstmt)
 }
 
 
+/*
+ * ExecCheckModifiedRelIds
+ *		Verify that every relation the executor actually opened for modification
+ *		or row locking is present in the planner's modifiedRelids.
+ *
+ * The planner's set may be a superset of what the executor touches, because it
+ * includes partitions that were pruned at runtime and parent row marks that the
+ * executor skips.
+ */
+#ifdef USE_ASSERT_CHECKING
+static void
+ExecCheckModifiedRelIds(EState *estate)
+{
+	PlannedStmt *plannedstmt = estate->es_plannedstmt;
+	Bitmapset  *executor_relids = NULL;
+	ListCell   *lc;
+
+	foreach(lc, estate->es_opened_result_relations)
+	{
+		ResultRelInfo *rri = (ResultRelInfo *) lfirst(lc);
+
+		if (rri->ri_RangeTableIndex != 0)
+			executor_relids = bms_add_member(executor_relids,
+											 rri->ri_RangeTableIndex);
+	}
+	if (estate->es_rowmarks)
+	{
+		for (int i = 0; i < estate->es_range_table_size; i++)
+		{
+			if (estate->es_rowmarks[i] != NULL)
+				executor_relids = bms_add_member(executor_relids,
+												 estate->es_rowmarks[i]->rti);
+		}
+	}
+	Assert(bms_is_subset(executor_relids, plannedstmt->modifiedRelids));
+	bms_free(executor_relids);
+}
+#endif
+
+
 /* ----------------------------------------------------------------
  *		InitPlan
  *
@@ -992,6 +1035,10 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 	 */
 	planstate = ExecInitNode(plan, estate, eflags);
 
+#ifdef USE_ASSERT_CHECKING
+	ExecCheckModifiedRelIds(estate);
+#endif
+
 	/*
 	 * Get the tuple descriptor describing the type of tuples to return.
 	 */
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index ac84af294c9..4f39767d033 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -188,6 +188,7 @@ ExecSerializePlan(Plan *plan, EState *estate)
 	pstmt->partPruneInfos = estate->es_part_prune_infos;
 	pstmt->rtable = estate->es_range_table;
 	pstmt->unprunableRelids = estate->es_unpruned_relids;
+	pstmt->modifiedRelids = estate->es_plannedstmt->modifiedRelids;
 	pstmt->permInfos = estate->es_rteperminfos;
 	pstmt->resultRelations = NIL;
 	pstmt->appendRelations = NIL;
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 8d865470780..d67f24fca8c 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -113,6 +113,10 @@ lnext:
 		}
 		erm->ermActive = true;
 
+		/* verify this relation is in the planner's modifiedRelids */
+		Assert(bms_is_member(erm->rti,
+							 estate->es_plannedstmt->modifiedRelids));
+
 		/* fetch the tuple's ctid */
 		datum = ExecGetJunkAttribute(slot,
 									 aerm->ctidAttNo,
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 4cd5e262e0f..6b4ee4f9378 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -896,6 +896,16 @@ ExecInsert(ModifyTableContext *context,
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
 
+	/*
+	 * Verify this relation is in the planner's set of modified relations.
+	 * Partitions opened by tuple routing have ri_RangeTableIndex == 0 because
+	 * they have no range table entry, so we can only check relations that are
+	 * in the range table.
+	 */
+	Assert(resultRelInfo->ri_RangeTableIndex == 0 ||
+		   bms_is_member(resultRelInfo->ri_RangeTableIndex,
+						 estate->es_plannedstmt->modifiedRelids));
+
 	/*
 	 * Open the table's indexes, if we have not done so already, so that we
 	 * can add new index entries for the inserted tuple.
@@ -1523,6 +1533,10 @@ ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 {
 	EState	   *estate = context->estate;
 
+	Assert(resultRelInfo->ri_RangeTableIndex == 0 ||
+		   bms_is_member(resultRelInfo->ri_RangeTableIndex,
+						 estate->es_plannedstmt->modifiedRelids));
+
 	return table_tuple_delete(resultRelInfo->ri_RelationDesc, tupleid,
 							  estate->es_output_cid,
 							  estate->es_snapshot,
@@ -2205,6 +2219,10 @@ ExecUpdateAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	bool		partition_constraint_failed;
 	TM_Result	result;
 
+	Assert(resultRelInfo->ri_RangeTableIndex == 0 ||
+		   bms_is_member(resultRelInfo->ri_RangeTableIndex,
+						 estate->es_plannedstmt->modifiedRelids));
+
 	updateCxt->crossPartUpdate = false;
 
 	/*
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 42604a0f75c..847af979e31 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -340,8 +340,10 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
 	RelOptInfo *final_rel;
 	Path	   *best_path;
 	Plan	   *top_plan;
+	Bitmapset  *modifiedRelids = NULL;
 	ListCell   *lp,
-			   *lr;
+			   *lr,
+			   *lc;
 
 	/*
 	 * Set up global state for this planner invocation.  This data is needed
@@ -661,6 +663,23 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
 	result->subplans = glob->subplans;
 	result->rewindPlanIDs = glob->rewindPlanIDs;
 	result->rowMarks = glob->finalrowmarks;
+
+	/*
+	 * Compute modifiedRelids from result relations and row marks.  This is a
+	 * superset of what the executor will actually modify/lock at runtime,
+	 * because runtime partition pruning may eliminate some result relations,
+	 * and parent row marks are included here but skipped by the executor.
+	 */
+	foreach(lc, glob->resultRelations)
+		modifiedRelids = bms_add_member(modifiedRelids, lfirst_int(lc));
+	foreach(lc, glob->finalrowmarks)
+	{
+		PlanRowMark *rc = (PlanRowMark *) lfirst(lc);
+
+		modifiedRelids = bms_add_member(modifiedRelids, rc->rti);
+	}
+	result->modifiedRelids = modifiedRelids;
+
 	result->relationOids = glob->relationOids;
 	result->invalItems = glob->invalItems;
 	result->paramExecTypes = glob->paramExecTypes;
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index b6185825fcb..841c7707c59 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -112,6 +112,16 @@ typedef struct PlannedStmt
 	 */
 	Bitmapset  *unprunableRelids;
 
+	/*
+	 * RT indexes of relations modified by the query through
+	 * UPDATE/DELETE/INSERT/MERGE or targeted by SELECT FOR UPDATE/SHARE.
+	 *
+	 * Computed by the planner, this is a superset of what the executor will
+	 * actually touch at runtime, because it includes partitions that may be
+	 * pruned and parent row marks that the executor skips.
+	 */
+	Bitmapset  *modifiedRelids;
+
 	/*
 	 * list of RTEPermissionInfo nodes for rtable entries needing one
 	 */
-- 
2.43.0



  [text/x-patch] v41-0009-Thread-flags-through-begin-scan-APIs.patch (32.9K, 10-v41-0009-Thread-flags-through-begin-scan-APIs.patch)
  download | inline diff:
From a2aca8dcd8e944058ded737184343a6960f7cee6 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Thu, 19 Mar 2026 17:05:55 -0400
Subject: [PATCH v41 09/12] Thread flags through begin-scan APIs

Add a user-settable flags parameter to the table_beginscan_* wrappers,
index_beginscan(), table_index_fetch_begin(), and the table
AM callback index_fetch_begin(). This allows users to pass additional
context to be used when building the scan descriptors.

For index scans, a new uint32 flags field is added to
IndexFetchTableData, and the heap AM stores the caller-provided flags
there in heapam_index_fetch_begin().

This introduces an extension point for follow-up work to pass
per-scan information (such as whether the relation is read-only for the
current query) from the executor to the AM layer.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://postgr.es/m/F5CDD1B5-628C-44A1-9F85-3958C626F6A9%40gmail.com
---
 contrib/pgrowlocks/pgrowlocks.c           |  2 +-
 src/backend/access/brin/brin.c            |  3 +-
 src/backend/access/gin/gininsert.c        |  3 +-
 src/backend/access/heap/heapam_handler.c  |  9 ++-
 src/backend/access/index/genam.c          |  2 +
 src/backend/access/index/indexam.c        |  7 ++-
 src/backend/access/nbtree/nbtsort.c       |  2 +-
 src/backend/access/table/tableam.c        | 21 +++----
 src/backend/commands/constraint.c         |  2 +-
 src/backend/commands/copyto.c             |  2 +-
 src/backend/commands/tablecmds.c          |  8 +--
 src/backend/commands/typecmds.c           |  4 +-
 src/backend/executor/execIndexing.c       |  3 +-
 src/backend/executor/execReplication.c    | 12 ++--
 src/backend/executor/nodeBitmapHeapscan.c |  2 +-
 src/backend/executor/nodeIndexonlyscan.c  |  5 +-
 src/backend/executor/nodeIndexscan.c      |  6 +-
 src/backend/executor/nodeSamplescan.c     |  2 +-
 src/backend/executor/nodeSeqscan.c        |  6 +-
 src/backend/executor/nodeTidrangescan.c   |  6 +-
 src/backend/partitioning/partbounds.c     |  2 +-
 src/backend/utils/adt/selfuncs.c          |  1 +
 src/include/access/genam.h                |  4 +-
 src/include/access/heapam.h               |  5 +-
 src/include/access/relscan.h              |  1 +
 src/include/access/tableam.h              | 72 +++++++++++++++--------
 26 files changed, 117 insertions(+), 75 deletions(-)

diff --git a/contrib/pgrowlocks/pgrowlocks.c b/contrib/pgrowlocks/pgrowlocks.c
index ff3692c87c4..0556e9f7b88 100644
--- a/contrib/pgrowlocks/pgrowlocks.c
+++ b/contrib/pgrowlocks/pgrowlocks.c
@@ -115,7 +115,7 @@ pgrowlocks(PG_FUNCTION_ARGS)
 					   RelationGetRelationName(rel));
 
 	/* Scan the relation */
-	scan = table_beginscan(rel, GetActiveSnapshot(), 0, NULL);
+	scan = table_beginscan(rel, GetActiveSnapshot(), 0, NULL, 0);
 	hscan = (HeapScanDesc) scan;
 
 	attinmeta = TupleDescGetAttInMetadata(rsinfo->setDesc);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 2a0f8c8e3b8..b25e814a996 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2844,7 +2844,8 @@ _brin_parallel_scan_and_build(BrinBuildState *state,
 	indexInfo->ii_Concurrent = brinshared->isconcurrent;
 
 	scan = table_beginscan_parallel(heap,
-									ParallelTableScanFromBrinShared(brinshared));
+									ParallelTableScanFromBrinShared(brinshared),
+									0);
 
 	reltuples = table_index_build_scan(heap, index, indexInfo, true, true,
 									   brinbuildCallbackParallel, state, scan);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index e54782d9dd8..555b16771e9 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -2068,7 +2068,8 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 	indexInfo->ii_Concurrent = ginshared->isconcurrent;
 
 	scan = table_beginscan_parallel(heap,
-									ParallelTableScanFromGinBuildShared(ginshared));
+									ParallelTableScanFromGinBuildShared(ginshared),
+									0);
 
 	reltuples = table_index_build_scan(heap, index, indexInfo, true, progress,
 									   ginBuildCallbackParallel, state, scan);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 253a735b6c1..66726b22de6 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -80,11 +80,12 @@ heapam_slot_callbacks(Relation relation)
  */
 
 static IndexFetchTableData *
-heapam_index_fetch_begin(Relation rel)
+heapam_index_fetch_begin(Relation rel, uint32 flags)
 {
 	IndexFetchHeapData *hscan = palloc0_object(IndexFetchHeapData);
 
 	hscan->xs_base.rel = rel;
+	hscan->xs_base.flags = flags;
 	hscan->xs_cbuf = InvalidBuffer;
 	hscan->xs_vmbuffer = InvalidBuffer;
 
@@ -762,7 +763,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0);
+		indexScan = index_beginscan(OldHeap, OldIndex,
+									0,	/* flags */
+									SnapshotAny, NULL, 0, 0);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
@@ -771,7 +774,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 		pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
 									 PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP);
 
-		tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
+		tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL, 0);
 		heapScan = (HeapScanDesc) tableScan;
 		indexScan = NULL;
 
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 5e89b86a62c..b099d956e41 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -455,6 +455,7 @@ systable_beginscan(Relation heapRelation,
 		}
 
 		sysscan->iscan = index_beginscan(heapRelation, irel,
+										 0, /* flags */
 										 snapshot, NULL, nkeys, 0);
 		index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 		sysscan->scan = NULL;
@@ -716,6 +717,7 @@ systable_beginscan_ordered(Relation heapRelation,
 		bsysscan = true;
 
 	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
+									 0, /* flags */
 									 snapshot, NULL, nkeys, 0);
 	index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 	sysscan->scan = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 5eb7e99ad3e..63d5daadca6 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -255,6 +255,7 @@ index_insert_cleanup(Relation indexRelation,
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
+				uint32 flags,
 				Snapshot snapshot,
 				IndexScanInstrumentation *instrument,
 				int nkeys, int norderbys)
@@ -284,7 +285,7 @@ index_beginscan(Relation heapRelation,
 	scan->instrument = instrument;
 
 	/* prepare to fetch index matches from table */
-	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
+	scan->xs_heapfetch = table_index_fetch_begin(heapRelation, flags);
 
 	return scan;
 }
@@ -593,7 +594,7 @@ IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel,
 						 IndexScanInstrumentation *instrument,
 						 int nkeys, int norderbys,
-						 ParallelIndexScanDesc pscan)
+						 ParallelIndexScanDesc pscan, uint32 flags)
 {
 	Snapshot	snapshot;
 	IndexScanDesc scan;
@@ -615,7 +616,7 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 	scan->instrument = instrument;
 
 	/* prepare to fetch index matches from table */
-	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
+	scan->xs_heapfetch = table_index_fetch_begin(heaprel, flags);
 
 	return scan;
 }
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 47a9bda30c9..016a5e546dd 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1928,7 +1928,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	indexInfo = BuildIndexInfo(btspool->index);
 	indexInfo->ii_Concurrent = btshared->isconcurrent;
 	scan = table_beginscan_parallel(btspool->heap,
-									ParallelTableScanFromBTShared(btshared));
+									ParallelTableScanFromBTShared(btshared), 0);
 	reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
 									   true, progress, _bt_build_callback,
 									   &buildstate, scan);
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index dfda1af412e..7a12e808b07 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -118,7 +118,7 @@ table_beginscan_catalog(Relation relation, int nkeys, ScanKeyData *key)
 	Snapshot	snapshot = RegisterSnapshot(GetCatalogSnapshot(relid));
 
 	return table_beginscan_common(relation, snapshot, nkeys, key,
-								  NULL, flags);
+								  NULL, flags, 0);
 }
 
 
@@ -163,10 +163,10 @@ table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
 }
 
 TableScanDesc
-table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
+table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan, uint32 flags)
 {
 	Snapshot	snapshot;
-	uint32		flags = SO_TYPE_SEQSCAN |
+	uint32		internal_flags = SO_TYPE_SEQSCAN |
 		SO_ALLOW_STRAT | SO_ALLOW_SYNC | SO_ALLOW_PAGEMODE;
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
@@ -176,7 +176,7 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
 		RegisterSnapshot(snapshot);
-		flags |= SO_TEMP_SNAPSHOT;
+		internal_flags |= SO_TEMP_SNAPSHOT;
 	}
 	else
 	{
@@ -185,16 +185,17 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 	}
 
 	return table_beginscan_common(relation, snapshot, 0, NULL,
-								  pscan, flags);
+								  pscan, internal_flags, flags);
 }
 
 TableScanDesc
 table_beginscan_parallel_tidrange(Relation relation,
-								  ParallelTableScanDesc pscan)
+								  ParallelTableScanDesc pscan,
+								  uint32 flags)
 {
 	Snapshot	snapshot;
-	uint32		flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
 	TableScanDesc sscan;
+	uint32		internal_flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
@@ -206,7 +207,7 @@ table_beginscan_parallel_tidrange(Relation relation,
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
 		RegisterSnapshot(snapshot);
-		flags |= SO_TEMP_SNAPSHOT;
+		internal_flags |= SO_TEMP_SNAPSHOT;
 	}
 	else
 	{
@@ -215,7 +216,7 @@ table_beginscan_parallel_tidrange(Relation relation,
 	}
 
 	sscan = table_beginscan_common(relation, snapshot, 0, NULL,
-								   pscan, flags);
+								   pscan, internal_flags, flags);
 	return sscan;
 }
 
@@ -248,7 +249,7 @@ table_index_fetch_tuple_check(Relation rel,
 	bool		found;
 
 	slot = table_slot_create(rel, NULL);
-	scan = table_index_fetch_begin(rel);
+	scan = table_index_fetch_begin(rel, 0);
 	found = table_index_fetch_tuple(scan, tid, snapshot, slot, &call_again,
 									all_dead);
 	table_index_fetch_end(scan);
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index cc11c47b6f2..37cfbd63938 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -106,7 +106,7 @@ unique_key_recheck(PG_FUNCTION_ARGS)
 	 */
 	tmptid = checktid;
 	{
-		IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation);
+		IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation, 0);
 		bool		call_again = false;
 
 		if (!table_index_fetch_tuple(scan, &tmptid, SnapshotSelf, slot,
diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c
index 499ce9ad3db..fb791c7990b 100644
--- a/src/backend/commands/copyto.c
+++ b/src/backend/commands/copyto.c
@@ -1160,7 +1160,7 @@ CopyRelationTo(CopyToState cstate, Relation rel, Relation root_rel, uint64 *proc
 	AttrMap    *map = NULL;
 	TupleTableSlot *root_slot = NULL;
 
-	scandesc = table_beginscan(rel, GetActiveSnapshot(), 0, NULL);
+	scandesc = table_beginscan(rel, GetActiveSnapshot(), 0, NULL, 0);
 	slot = table_slot_create(rel, NULL);
 
 	/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 67e42e5df29..cc2ec9393a8 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -6411,7 +6411,7 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap)
 		 * checking all the constraints.
 		 */
 		snapshot = RegisterSnapshot(GetLatestSnapshot());
-		scan = table_beginscan(oldrel, snapshot, 0, NULL);
+		scan = table_beginscan(oldrel, snapshot, 0, NULL, 0);
 
 		/*
 		 * Switch to per-tuple memory context and reset it for each tuple
@@ -13980,7 +13980,7 @@ validateForeignKeyConstraint(char *conname,
 	 */
 	snapshot = RegisterSnapshot(GetLatestSnapshot());
 	slot = table_slot_create(rel, NULL);
-	scan = table_beginscan(rel, snapshot, 0, NULL);
+	scan = table_beginscan(rel, snapshot, 0, NULL, 0);
 
 	perTupCxt = AllocSetContextCreate(CurrentMemoryContext,
 									  "validateForeignKeyConstraint",
@@ -22881,7 +22881,7 @@ MergePartitionsMoveRows(List **wqueue, List *mergingPartitions, Relation newPart
 
 		/* Scan through the rows. */
 		snapshot = RegisterSnapshot(GetLatestSnapshot());
-		scan = table_beginscan(mergingPartition, snapshot, 0, NULL);
+		scan = table_beginscan(mergingPartition, snapshot, 0, NULL, 0);
 
 		/*
 		 * Switch to per-tuple memory context and reset it for each tuple
@@ -23345,7 +23345,7 @@ SplitPartitionMoveRows(List **wqueue, Relation rel, Relation splitRel,
 
 	/* Scan through the rows. */
 	snapshot = RegisterSnapshot(GetLatestSnapshot());
-	scan = table_beginscan(splitRel, snapshot, 0, NULL);
+	scan = table_beginscan(splitRel, snapshot, 0, NULL, 0);
 
 	/*
 	 * Switch to per-tuple memory context and reset it for each tuple
diff --git a/src/backend/commands/typecmds.c b/src/backend/commands/typecmds.c
index 3dab6bb5a79..5316cea7cec 100644
--- a/src/backend/commands/typecmds.c
+++ b/src/backend/commands/typecmds.c
@@ -3185,7 +3185,7 @@ validateDomainNotNullConstraint(Oid domainoid)
 
 		/* Scan all tuples in this relation */
 		snapshot = RegisterSnapshot(GetLatestSnapshot());
-		scan = table_beginscan(testrel, snapshot, 0, NULL);
+		scan = table_beginscan(testrel, snapshot, 0, NULL, 0);
 		slot = table_slot_create(testrel, NULL);
 		while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 		{
@@ -3266,7 +3266,7 @@ validateDomainCheckConstraint(Oid domainoid, const char *ccbin, LOCKMODE lockmod
 
 		/* Scan all tuples in this relation */
 		snapshot = RegisterSnapshot(GetLatestSnapshot());
-		scan = table_beginscan(testrel, snapshot, 0, NULL);
+		scan = table_beginscan(testrel, snapshot, 0, NULL, 0);
 		slot = table_slot_create(testrel, NULL);
 		while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 		{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 9d071e495c6..c46beedeb71 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -815,7 +815,8 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, NULL, indnkeyatts, 0);
+	index_scan = index_beginscan(heap, index, 0,	/* flags */
+								 &DirtySnapshot, NULL, indnkeyatts, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
 	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 2497ee7edc5..23509771557 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -205,7 +205,9 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
 	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, NULL, skey_attoff, 0);
+	scan = index_beginscan(rel, idxrel,
+						   0,	/* flags */
+						   &snap, NULL, skey_attoff, 0);
 
 retry:
 	found = false;
@@ -383,7 +385,7 @@ RelationFindReplTupleSeq(Relation rel, LockTupleMode lockmode,
 
 	/* Start a heap scan. */
 	InitDirtySnapshot(snap);
-	scan = table_beginscan(rel, &snap, 0, NULL);
+	scan = table_beginscan(rel, &snap, 0, NULL, 0);
 	scanslot = table_slot_create(rel, NULL);
 
 retry:
@@ -602,7 +604,7 @@ RelationFindDeletedTupleInfoSeq(Relation rel, TupleTableSlot *searchslot,
 	 * not yet committed or those just committed prior to the scan are
 	 * excluded in update_most_recent_deletion_info().
 	 */
-	scan = table_beginscan(rel, SnapshotAny, 0, NULL);
+	scan = table_beginscan(rel, SnapshotAny, 0, NULL, 0);
 	scanslot = table_slot_create(rel, NULL);
 
 	table_rescan(scan, NULL);
@@ -666,7 +668,9 @@ RelationFindDeletedTupleInfoByIndex(Relation rel, Oid idxoid,
 	 * not yet committed or those just committed prior to the scan are
 	 * excluded in update_most_recent_deletion_info().
 	 */
-	scan = index_beginscan(rel, idxrel, SnapshotAny, NULL, skey_attoff, 0);
+	scan = index_beginscan(rel, idxrel,
+						   0,	/* flags */
+						   SnapshotAny, NULL, skey_attoff, 0);
 
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 7cf8d23c742..324e2bed22c 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -148,7 +148,7 @@ BitmapTableScanSetup(BitmapHeapScanState *node)
 			table_beginscan_bm(node->ss.ss_currentRelation,
 							   node->ss.ps.state->es_snapshot,
 							   0,
-							   NULL);
+							   NULL, 0);
 	}
 
 	node->ss.ss_currentScanDesc->st.rs_tbmiterator = tbmiterator;
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index c8db357e69f..decfd792809 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -92,6 +92,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->ioss_RelationDesc,
+								   0,	/* flags */
 								   estate->es_snapshot,
 								   &node->ioss_Instrument,
 								   node->ioss_NumScanKeys,
@@ -790,7 +791,7 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 								 &node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan, 0);
 	node->ioss_ScanDesc->xs_want_itup = true;
 	node->ioss_VMBuffer = InvalidBuffer;
 
@@ -856,7 +857,7 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 								 &node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan, 0);
 	node->ioss_ScanDesc->xs_want_itup = true;
 
 	/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index bd83e4712b3..a37fa9abece 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -110,6 +110,7 @@ IndexNext(IndexScanState *node)
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->iss_RelationDesc,
+								   0,	/* flags */
 								   estate->es_snapshot,
 								   &node->iss_Instrument,
 								   node->iss_NumScanKeys,
@@ -206,6 +207,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->iss_RelationDesc,
+								   0,	/* flags */
 								   estate->es_snapshot,
 								   &node->iss_Instrument,
 								   node->iss_NumScanKeys,
@@ -1726,7 +1728,7 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 								 &node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan, 0);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -1790,7 +1792,7 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 								 &node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan, 0);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/executor/nodeSamplescan.c b/src/backend/executor/nodeSamplescan.c
index 6b0d65f752f..cc6b23abee0 100644
--- a/src/backend/executor/nodeSamplescan.c
+++ b/src/backend/executor/nodeSamplescan.c
@@ -298,7 +298,7 @@ tablesample_init(SampleScanState *scanstate)
 									 0, NULL,
 									 scanstate->use_bulkread,
 									 allow_sync,
-									 scanstate->use_pagemode);
+									 scanstate->use_pagemode, 0);
 	}
 	else
 	{
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 8f219f60a93..c2d9b7293de 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -71,7 +71,7 @@ SeqNext(SeqScanState *node)
 		 */
 		scandesc = table_beginscan(node->ss.ss_currentRelation,
 								   estate->es_snapshot,
-								   0, NULL);
+								   0, NULL, 0);
 		node->ss.ss_currentScanDesc = scandesc;
 	}
 
@@ -375,7 +375,7 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 								  estate->es_snapshot);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
-		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
+		table_beginscan_parallel(node->ss.ss_currentRelation, pscan, 0);
 }
 
 /* ----------------------------------------------------------------
@@ -408,5 +408,5 @@ ExecSeqScanInitializeWorker(SeqScanState *node,
 
 	pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
 	node->ss.ss_currentScanDesc =
-		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
+		table_beginscan_parallel(node->ss.ss_currentRelation, pscan, 0);
 }
diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c
index 617713bde04..994f70989bc 100644
--- a/src/backend/executor/nodeTidrangescan.c
+++ b/src/backend/executor/nodeTidrangescan.c
@@ -245,7 +245,7 @@ TidRangeNext(TidRangeScanState *node)
 			scandesc = table_beginscan_tidrange(node->ss.ss_currentRelation,
 												estate->es_snapshot,
 												&node->trss_mintid,
-												&node->trss_maxtid);
+												&node->trss_maxtid, 0);
 			node->ss.ss_currentScanDesc = scandesc;
 		}
 		else
@@ -460,7 +460,7 @@ ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt)
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel_tidrange(node->ss.ss_currentRelation,
-										  pscan);
+										  pscan, 0);
 }
 
 /* ----------------------------------------------------------------
@@ -494,5 +494,5 @@ ExecTidRangeScanInitializeWorker(TidRangeScanState *node,
 	pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel_tidrange(node->ss.ss_currentRelation,
-										  pscan);
+										  pscan, 0);
 }
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index 0ca312ac27d..b7c4e6d1071 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -3362,7 +3362,7 @@ check_default_partition_contents(Relation parent, Relation default_rel,
 		econtext = GetPerTupleExprContext(estate);
 		snapshot = RegisterSnapshot(GetLatestSnapshot());
 		tupslot = table_slot_create(part_rel, &estate->es_tupleTable);
-		scan = table_beginscan(part_rel, snapshot, 0, NULL);
+		scan = table_beginscan(part_rel, snapshot, 0, NULL, 0);
 
 		/*
 		 * Switch to per-tuple memory context and reset it for each tuple
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 86b55c9bb8b..1d64d286881 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -7177,6 +7177,7 @@ get_actual_variable_endpoint(Relation heapRel,
 							  GlobalVisTestFor(heapRel));
 
 	index_scan = index_beginscan(heapRel, indexRel,
+								 0, /* flags */
 								 &SnapshotNonVacuumable, NULL,
 								 1, 0);
 	/* Set it up for index-only scan */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 1a27bf060b3..b98c20a0edc 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -156,6 +156,7 @@ extern void index_insert_cleanup(Relation indexRelation,
 
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
+									 uint32 flags,
 									 Snapshot snapshot,
 									 IndexScanInstrumentation *instrument,
 									 int nkeys, int norderbys);
@@ -184,7 +185,8 @@ extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel,
 											  IndexScanInstrumentation *instrument,
 											  int nkeys, int norderbys,
-											  ParallelIndexScanDesc pscan);
+											  ParallelIndexScanDesc pscan,
+											  uint32 flags);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
 extern bool index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b9577c24844..e32f28d7acb 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -95,10 +95,7 @@ typedef struct HeapScanDescData
 	 */
 	ParallelBlockTableScanWorkerData *rs_parallelworkerdata;
 
-	/*
-	 * For sequential scans and bitmap heap scans. The current heap block's
-	 * corresponding page in the visibility map.
-	 */
+	/* Current heap block's corresponding page in the visibility map */
 	Buffer		rs_vmbuffer;
 
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index ce340c076f8..80ea0b437d1 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -122,6 +122,7 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	uint32		flags;
 } IndexFetchTableData;
 
 struct IndexScanInstrumentation;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 06084752245..8357d05d83b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -65,6 +65,16 @@ typedef enum ScanOptions
 	SO_TEMP_SNAPSHOT = 1 << 9,
 }			ScanOptions;
 
+/*
+ * Mask of flags that are set internally by the table_beginscan_* functions
+ * and must not be passed by callers.
+ */
+#define SO_INTERNAL_FLAGS \
+	(SO_TYPE_SEQSCAN | SO_TYPE_BITMAPSCAN | SO_TYPE_SAMPLESCAN | \
+	 SO_TYPE_TIDSCAN | SO_TYPE_TIDRANGESCAN | SO_TYPE_ANALYZE | \
+	 SO_ALLOW_STRAT | SO_ALLOW_SYNC | SO_ALLOW_PAGEMODE | \
+	 SO_TEMP_SNAPSHOT)
+
 /*
  * Result codes for table_{update,delete,lock_tuple}, and for visibility
  * routines inside table AMs.
@@ -420,7 +430,7 @@ typedef struct TableAmRoutine
 	 *
 	 * Tuples for an index scan can then be fetched via index_fetch_tuple.
 	 */
-	struct IndexFetchTableData *(*index_fetch_begin) (Relation rel);
+	struct IndexFetchTableData *(*index_fetch_begin) (Relation rel, uint32 flags);
 
 	/*
 	 * Reset index fetch. Typically this will release cross index fetch
@@ -871,12 +881,18 @@ extern TupleTableSlot *table_slot_create(Relation relation, List **reglist);
  * A wrapper around the Table Access Method scan_begin callback, to centralize
  * error checking. All calls to ->scan_begin() should go through this
  * function.
+ *
+ * The caller-provided user_flags are validated against SO_INTERNAL_FLAGS to
+ * catch callers that accidentally pass scan-type or other internal flags.
  */
 static TableScanDesc
 table_beginscan_common(Relation rel, Snapshot snapshot, int nkeys,
 					   ScanKeyData *key, ParallelTableScanDesc pscan,
-					   uint32 flags)
+					   uint32 flags, uint32 user_flags)
 {
+	Assert((user_flags & SO_INTERNAL_FLAGS) == 0);
+	flags |= user_flags;
+
 	/*
 	 * We don't allow scans to be started while CheckXidAlive is set, except
 	 * via systable_beginscan() et al.  See detailed comments in xact.c where
@@ -894,12 +910,13 @@ table_beginscan_common(Relation rel, Snapshot snapshot, int nkeys,
  */
 static inline TableScanDesc
 table_beginscan(Relation rel, Snapshot snapshot,
-				int nkeys, ScanKeyData *key)
+				int nkeys, ScanKeyData *key, uint32 flags)
 {
-	uint32		flags = SO_TYPE_SEQSCAN |
+	uint32		internal_flags = SO_TYPE_SEQSCAN |
 		SO_ALLOW_STRAT | SO_ALLOW_SYNC | SO_ALLOW_PAGEMODE;
 
-	return table_beginscan_common(rel, snapshot, nkeys, key, NULL, flags);
+	return table_beginscan_common(rel, snapshot, nkeys, key, NULL,
+								  internal_flags, flags);
 }
 
 /*
@@ -928,7 +945,7 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
 
-	return table_beginscan_common(rel, snapshot, nkeys, key, NULL, flags);
+	return table_beginscan_common(rel, snapshot, nkeys, key, NULL, flags, 0);
 }
 
 /*
@@ -939,11 +956,12 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
  */
 static inline TableScanDesc
 table_beginscan_bm(Relation rel, Snapshot snapshot,
-				   int nkeys, ScanKeyData *key)
+				   int nkeys, ScanKeyData *key, uint32 flags)
 {
-	uint32		flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE;
+	uint32		internal_flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE;
 
-	return table_beginscan_common(rel, snapshot, nkeys, key, NULL, flags);
+	return table_beginscan_common(rel, snapshot, nkeys, key, NULL,
+								  internal_flags, flags);
 }
 
 /*
@@ -957,18 +975,19 @@ static inline TableScanDesc
 table_beginscan_sampling(Relation rel, Snapshot snapshot,
 						 int nkeys, ScanKeyData *key,
 						 bool allow_strat, bool allow_sync,
-						 bool allow_pagemode)
+						 bool allow_pagemode, uint32 flags)
 {
-	uint32		flags = SO_TYPE_SAMPLESCAN;
+	uint32		internal_flags = SO_TYPE_SAMPLESCAN;
 
 	if (allow_strat)
-		flags |= SO_ALLOW_STRAT;
+		internal_flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
-		flags |= SO_ALLOW_SYNC;
+		internal_flags |= SO_ALLOW_SYNC;
 	if (allow_pagemode)
-		flags |= SO_ALLOW_PAGEMODE;
+		internal_flags |= SO_ALLOW_PAGEMODE;
 
-	return table_beginscan_common(rel, snapshot, nkeys, key, NULL, flags);
+	return table_beginscan_common(rel, snapshot, nkeys, key, NULL,
+								  internal_flags, flags);
 }
 
 /*
@@ -981,7 +1000,7 @@ table_beginscan_tid(Relation rel, Snapshot snapshot)
 {
 	uint32		flags = SO_TYPE_TIDSCAN;
 
-	return table_beginscan_common(rel, snapshot, 0, NULL, NULL, flags);
+	return table_beginscan_common(rel, snapshot, 0, NULL, NULL, flags, 0);
 }
 
 /*
@@ -994,7 +1013,7 @@ table_beginscan_analyze(Relation rel)
 {
 	uint32		flags = SO_TYPE_ANALYZE;
 
-	return table_beginscan_common(rel, NULL, 0, NULL, NULL, flags);
+	return table_beginscan_common(rel, NULL, 0, NULL, NULL, flags, 0);
 }
 
 /*
@@ -1059,12 +1078,13 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
 static inline TableScanDesc
 table_beginscan_tidrange(Relation rel, Snapshot snapshot,
 						 ItemPointer mintid,
-						 ItemPointer maxtid)
+						 ItemPointer maxtid, uint32 flags)
 {
 	TableScanDesc sscan;
-	uint32		flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
+	uint32		internal_flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
 
-	sscan = table_beginscan_common(rel, snapshot, 0, NULL, NULL, flags);
+	sscan = table_beginscan_common(rel, snapshot, 0, NULL, NULL,
+								   internal_flags, flags);
 
 	/* Set the range of TIDs to scan */
 	sscan->rs_rd->rd_tableam->scan_set_tidrange(sscan, mintid, maxtid);
@@ -1139,7 +1159,8 @@ extern void table_parallelscan_initialize(Relation rel,
  * Caller must hold a suitable lock on the relation.
  */
 extern TableScanDesc table_beginscan_parallel(Relation relation,
-											  ParallelTableScanDesc pscan);
+											  ParallelTableScanDesc pscan,
+											  uint32 flags);
 
 /*
  * Begin a parallel tid range scan. `pscan` needs to have been initialized
@@ -1149,7 +1170,8 @@ extern TableScanDesc table_beginscan_parallel(Relation relation,
  * Caller must hold a suitable lock on the relation.
  */
 extern TableScanDesc table_beginscan_parallel_tidrange(Relation relation,
-													   ParallelTableScanDesc pscan);
+													   ParallelTableScanDesc pscan,
+													   uint32 flags);
 
 /*
  * Restart a parallel scan.  Call this in the leader process.  Caller is
@@ -1175,8 +1197,10 @@ table_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan)
  * Tuples for an index scan can then be fetched via table_index_fetch_tuple().
  */
 static inline IndexFetchTableData *
-table_index_fetch_begin(Relation rel)
+table_index_fetch_begin(Relation rel, uint32 flags)
 {
+	Assert((flags & SO_INTERNAL_FLAGS) == 0);
+
 	/*
 	 * We don't allow scans to be started while CheckXidAlive is set, except
 	 * via systable_beginscan() et al.  See detailed comments in xact.c where
@@ -1185,7 +1209,7 @@ table_index_fetch_begin(Relation rel)
 	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
 		elog(ERROR, "scan started during logical decoding");
 
-	return rel->rd_tableam->index_fetch_begin(rel);
+	return rel->rd_tableam->index_fetch_begin(rel, flags);
 }
 
 /*
-- 
2.43.0



  [text/x-patch] v41-0010-Pass-down-information-on-table-modification-to-s.patch (11.3K, 11-v41-0010-Pass-down-information-on-table-modification-to-s.patch)
  download | inline diff:
From 512c2a19651aecab3a15712992586dce288b83fd Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Mon, 2 Mar 2026 16:31:33 -0500
Subject: [PATCH v41 10/12] Pass down information on table modification to scan
 node

Pass down information to sequential scan, index [only] scan, bitmap
table scan, sample scan, and TID range scan nodes on whether or not the
query modifies the relation being scanned. A later commit will use this
information to update the VM during on-access pruning only if the
relation is not modified by the query.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Andrey Borodin <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://postgr.es/m/4379FDA3-9446-4E2C-9C15-32EFE8D4F31B%40yandex-team.ru
---
 src/backend/executor/nodeBitmapHeapscan.c |  6 +++++-
 src/backend/executor/nodeIndexonlyscan.c  | 15 ++++++++++++---
 src/backend/executor/nodeIndexscan.c      | 18 ++++++++++++++----
 src/backend/executor/nodeSamplescan.c     |  5 ++++-
 src/backend/executor/nodeSeqscan.c        | 18 +++++++++++++++---
 src/backend/executor/nodeTidrangescan.c   | 15 ++++++++++++---
 src/include/access/tableam.h              |  3 +++
 src/include/executor/executor.h           | 10 ++++++++++
 8 files changed, 75 insertions(+), 15 deletions(-)

diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 324e2bed22c..aec92c868ac 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -144,11 +144,15 @@ BitmapTableScanSetup(BitmapHeapScanState *node)
 	 */
 	if (!node->ss.ss_currentScanDesc)
 	{
+		uint32		flags = ScanRelIsReadOnly(&node->ss) ?
+			SO_HINT_REL_READ_ONLY : 0;
+
 		node->ss.ss_currentScanDesc =
 			table_beginscan_bm(node->ss.ss_currentRelation,
 							   node->ss.ps.state->es_snapshot,
 							   0,
-							   NULL, 0);
+							   NULL,
+							   flags);
 	}
 
 	node->ss.ss_currentScanDesc->st.rs_tbmiterator = tbmiterator;
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index decfd792809..b977719c295 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -85,6 +85,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
 
 	if (scandesc == NULL)
 	{
+		uint32		flags = ScanRelIsReadOnly(&node->ss) ?
+			SO_HINT_REL_READ_ONLY : 0;
+
 		/*
 		 * We reach here if the index only scan is not parallel, or if we're
 		 * serially executing an index only scan that was planned to be
@@ -92,7 +95,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->ioss_RelationDesc,
-								   0,	/* flags */
+								   flags,
 								   estate->es_snapshot,
 								   &node->ioss_Instrument,
 								   node->ioss_NumScanKeys,
@@ -791,7 +794,10 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 								 &node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan, 0);
+								 piscan,
+								 ScanRelIsReadOnly(&node->ss) ?
+								 SO_HINT_REL_READ_ONLY : 0);
+
 	node->ioss_ScanDesc->xs_want_itup = true;
 	node->ioss_VMBuffer = InvalidBuffer;
 
@@ -857,7 +863,10 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 								 &node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan, 0);
+								 piscan,
+								 ScanRelIsReadOnly(&node->ss) ?
+								 SO_HINT_REL_READ_ONLY : 0);
+
 	node->ioss_ScanDesc->xs_want_itup = true;
 
 	/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index a37fa9abece..ad460c11679 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -104,13 +104,16 @@ IndexNext(IndexScanState *node)
 
 	if (scandesc == NULL)
 	{
+		uint32		flags = ScanRelIsReadOnly(&node->ss) ?
+			SO_HINT_REL_READ_ONLY : 0;
+
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->iss_RelationDesc,
-								   0,	/* flags */
+								   flags,
 								   estate->es_snapshot,
 								   &node->iss_Instrument,
 								   node->iss_NumScanKeys,
@@ -201,13 +204,16 @@ IndexNextWithReorder(IndexScanState *node)
 
 	if (scandesc == NULL)
 	{
+		uint32		flags = ScanRelIsReadOnly(&node->ss) ?
+			SO_HINT_REL_READ_ONLY : 0;
+
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->iss_RelationDesc,
-								   0,	/* flags */
+								   flags,
 								   estate->es_snapshot,
 								   &node->iss_Instrument,
 								   node->iss_NumScanKeys,
@@ -1728,7 +1734,9 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 								 &node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan, 0);
+								 piscan,
+								 ScanRelIsReadOnly(&node->ss) ?
+								 SO_HINT_REL_READ_ONLY : 0);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -1792,7 +1800,9 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 								 &node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan, 0);
+								 piscan,
+								 ScanRelIsReadOnly(&node->ss) ?
+								 SO_HINT_REL_READ_ONLY : 0);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/executor/nodeSamplescan.c b/src/backend/executor/nodeSamplescan.c
index cc6b23abee0..71c70e5e5c7 100644
--- a/src/backend/executor/nodeSamplescan.c
+++ b/src/backend/executor/nodeSamplescan.c
@@ -292,13 +292,16 @@ tablesample_init(SampleScanState *scanstate)
 	/* Now we can create or reset the HeapScanDesc */
 	if (scanstate->ss.ss_currentScanDesc == NULL)
 	{
+		uint32		flags = ScanRelIsReadOnly(&scanstate->ss) ?
+			SO_HINT_REL_READ_ONLY : 0;
+
 		scanstate->ss.ss_currentScanDesc =
 			table_beginscan_sampling(scanstate->ss.ss_currentRelation,
 									 scanstate->ss.ps.state->es_snapshot,
 									 0, NULL,
 									 scanstate->use_bulkread,
 									 allow_sync,
-									 scanstate->use_pagemode, 0);
+									 scanstate->use_pagemode, flags);
 	}
 	else
 	{
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index c2d9b7293de..79470e6b9b5 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -65,13 +65,17 @@ SeqNext(SeqScanState *node)
 
 	if (scandesc == NULL)
 	{
+		uint32		flags = ScanRelIsReadOnly(&node->ss) ?
+			SO_HINT_REL_READ_ONLY : 0;
+
 		/*
 		 * We reach here if the scan is not parallel, or if we're serially
 		 * executing a scan that was planned to be parallel.
 		 */
 		scandesc = table_beginscan(node->ss.ss_currentRelation,
 								   estate->es_snapshot,
-								   0, NULL, 0);
+								   0, NULL, flags);
+
 		node->ss.ss_currentScanDesc = scandesc;
 	}
 
@@ -368,14 +372,18 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 {
 	EState	   *estate = node->ss.ps.state;
 	ParallelTableScanDesc pscan;
+	uint32		flags = ScanRelIsReadOnly(&node->ss) ?
+		SO_HINT_REL_READ_ONLY : 0;
 
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
 								  estate->es_snapshot);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
+
 	node->ss.ss_currentScanDesc =
-		table_beginscan_parallel(node->ss.ss_currentRelation, pscan, 0);
+		table_beginscan_parallel(node->ss.ss_currentRelation, pscan,
+								 flags);
 }
 
 /* ----------------------------------------------------------------
@@ -405,8 +413,12 @@ ExecSeqScanInitializeWorker(SeqScanState *node,
 							ParallelWorkerContext *pwcxt)
 {
 	ParallelTableScanDesc pscan;
+	uint32		flags = ScanRelIsReadOnly(&node->ss) ?
+		SO_HINT_REL_READ_ONLY : 0;
 
 	pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
 	node->ss.ss_currentScanDesc =
-		table_beginscan_parallel(node->ss.ss_currentRelation, pscan, 0);
+		table_beginscan_parallel(node->ss.ss_currentRelation,
+								 pscan,
+								 flags);
 }
diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c
index 994f70989bc..4257afd96ed 100644
--- a/src/backend/executor/nodeTidrangescan.c
+++ b/src/backend/executor/nodeTidrangescan.c
@@ -242,10 +242,13 @@ TidRangeNext(TidRangeScanState *node)
 
 		if (scandesc == NULL)
 		{
+			uint32		flags = ScanRelIsReadOnly(&node->ss) ?
+				SO_HINT_REL_READ_ONLY : 0;
+
 			scandesc = table_beginscan_tidrange(node->ss.ss_currentRelation,
 												estate->es_snapshot,
 												&node->trss_mintid,
-												&node->trss_maxtid, 0);
+												&node->trss_maxtid, flags);
 			node->ss.ss_currentScanDesc = scandesc;
 		}
 		else
@@ -452,15 +455,18 @@ ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt)
 {
 	EState	   *estate = node->ss.ps.state;
 	ParallelTableScanDesc pscan;
+	uint32		flags = ScanRelIsReadOnly(&node->ss) ?
+		SO_HINT_REL_READ_ONLY : 0;
 
 	pscan = shm_toc_allocate(pcxt->toc, node->trss_pscanlen);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
 								  estate->es_snapshot);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
+
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel_tidrange(node->ss.ss_currentRelation,
-										  pscan, 0);
+										  pscan, flags);
 }
 
 /* ----------------------------------------------------------------
@@ -490,9 +496,12 @@ ExecTidRangeScanInitializeWorker(TidRangeScanState *node,
 								 ParallelWorkerContext *pwcxt)
 {
 	ParallelTableScanDesc pscan;
+	uint32		flags = ScanRelIsReadOnly(&node->ss) ?
+		SO_HINT_REL_READ_ONLY : 0;
 
 	pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel_tidrange(node->ss.ss_currentRelation,
-										  pscan, 0);
+										  pscan, flags);
 }
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8357d05d83b..487e38292fa 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -63,6 +63,9 @@ typedef enum ScanOptions
 
 	/* unregister snapshot at scan end? */
 	SO_TEMP_SNAPSHOT = 1 << 9,
+
+	/* set if the query doesn't modify the relation */
+	SO_HINT_REL_READ_ONLY = 1 << 10,
 }			ScanOptions;
 
 /*
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 07f4b1f7490..31c4192b67e 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -690,6 +690,16 @@ extern void ExecCreateScanSlotFromOuterPlan(EState *estate,
 
 extern bool ExecRelationIsTargetRelation(EState *estate, Index scanrelid);
 
+/*
+ * Return true if the scan node's relation is not modified by the query.
+ */
+static inline bool
+ScanRelIsReadOnly(ScanState *ss)
+{
+	return !bms_is_member(((Scan *) ss->ps.plan)->scanrelid,
+						  ss->ps.state->es_plannedstmt->modifiedRelids);
+}
+
 extern Relation ExecOpenScanRelation(EState *estate, Index scanrelid, int eflags);
 
 extern void ExecInitRangeTable(EState *estate, List *rangeTable, List *permInfos,
-- 
2.43.0



  [text/x-patch] v41-0011-Allow-on-access-pruning-to-set-pages-all-visible.patch (10.1K, 12-v41-0011-Allow-on-access-pruning-to-set-pages-all-visible.patch)
  download | inline diff:
From f02a7e880af91c4fb14edc75076d448f00462270 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Fri, 27 Feb 2026 16:33:40 -0500
Subject: [PATCH v41 11/12] Allow on-access pruning to set pages all-visible

Many queries do not modify the underlying relation. For such queries, if
on-access pruning occurs during the scan, we can check whether the page
has become all-visible and update the visibility map accordingly.
Previously, only vacuum and COPY FREEZE marked pages as all-visible or
all-frozen.

This commit implements on-access VM setting for sequential scans as well
as for the underlying heap relation in index scans and bitmap heap
scans.

Setting the visibility map on-access can avoid write amplification
caused by vacuum later needing to set the page all-visible, trigger a
write and potentially FPI. It also allows more frequent index-only
scans, since they requrie pages to be marked all-visible in the VM.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Kirill Reshke <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
---
 src/backend/access/heap/heapam.c         |  3 +-
 src/backend/access/heap/heapam_handler.c |  6 ++--
 src/backend/access/heap/pruneheap.c      | 46 +++++++++++++++++-------
 src/backend/access/heap/vacuumlazy.c     |  2 +-
 src/include/access/heapam.h              | 16 +++++++--
 5 files changed, 54 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 044f385e477..dbdf6521c42 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -633,7 +633,8 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	/*
 	 * Prune and repair fragmentation for the whole page, if possible.
 	 */
-	heap_page_prune_opt(scan->rs_base.rs_rd, buffer, &scan->rs_vmbuffer);
+	heap_page_prune_opt(scan->rs_base.rs_rd, buffer, &scan->rs_vmbuffer,
+						(sscan->rs_flags & SO_HINT_REL_READ_ONLY));
 
 	/*
 	 * We must hold share lock on the buffer content while examining tuple
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 66726b22de6..651efa0127a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -148,7 +148,8 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		 */
 		if (prev_buf != hscan->xs_cbuf)
 			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf,
-								&hscan->xs_vmbuffer);
+								&hscan->xs_vmbuffer,
+								(hscan->xs_base.flags & SO_HINT_REL_READ_ONLY));
 	}
 
 	/* Obtain share-lock on the buffer so we can examine visibility */
@@ -2545,7 +2546,8 @@ BitmapHeapScanNextBlock(TableScanDesc scan,
 	/*
 	 * Prune and repair fragmentation for the whole page, if possible.
 	 */
-	heap_page_prune_opt(scan->rs_rd, buffer, &hscan->rs_vmbuffer);
+	heap_page_prune_opt(scan->rs_rd, buffer, &hscan->rs_vmbuffer,
+						scan->rs_flags & SO_HINT_REL_READ_ONLY);
 
 	/*
 	 * We must hold share lock on the buffer content while examining tuple
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index fac7194dcba..deb7b948c1e 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -44,6 +44,8 @@ typedef struct
 	bool		mark_unused_now;
 	/* whether to attempt freezing tuples */
 	bool		attempt_freeze;
+	/* whether to attempt setting the VM */
+	bool		attempt_set_vm;
 	struct VacuumCutoffs *cutoffs;
 	Relation	relation;
 
@@ -236,7 +238,8 @@ static void page_verify_redirects(Page page);
 
 static bool heap_page_will_freeze(bool did_tuple_hint_fpi, bool do_prune, bool do_hint_prune,
 								  PruneState *prstate);
-static bool heap_page_will_set_vm(PruneState *prstate, PruneReason reason);
+static bool heap_page_will_set_vm(PruneState *prstate, PruneReason reason,
+								  bool do_prune, bool do_freeze);
 
 
 /*
@@ -257,7 +260,8 @@ static bool heap_page_will_set_vm(PruneState *prstate, PruneReason reason);
  * unpinning *vmbuffer.
  */
 void
-heap_page_prune_opt(Relation relation, Buffer buffer, Buffer *vmbuffer)
+heap_page_prune_opt(Relation relation, Buffer buffer, Buffer *vmbuffer,
+					bool rel_read_only)
 {
 	Page		page = BufferGetPage(buffer);
 	TransactionId prune_xid;
@@ -339,6 +343,8 @@ heap_page_prune_opt(Relation relation, Buffer buffer, Buffer *vmbuffer)
 			 * current implementation.
 			 */
 			params.options = HEAP_PAGE_PRUNE_ALLOW_FAST_PATH;
+			if (rel_read_only)
+				params.options |= HEAP_PAGE_PRUNE_SET_VM;
 
 			heap_page_prune_and_freeze(&params, &presult, &dummy_off_loc,
 									   NULL, NULL);
@@ -395,6 +401,7 @@ prune_freeze_setup(PruneFreezeParams *params,
 	/* cutoffs must be provided if we will attempt freezing */
 	Assert(!(params->options & HEAP_PAGE_PRUNE_FREEZE) || params->cutoffs);
 	prstate->attempt_freeze = (params->options & HEAP_PAGE_PRUNE_FREEZE) != 0;
+	prstate->attempt_set_vm = (params->options & HEAP_PAGE_PRUNE_SET_VM) != 0;
 	prstate->cutoffs = params->cutoffs;
 	prstate->relation = params->relation;
 	prstate->block = BufferGetBlockNumber(params->buffer);
@@ -474,9 +481,8 @@ prune_freeze_setup(PruneFreezeParams *params,
 	 * We track whether the page will be all-visible/all-frozen at the end of
 	 * pruning and freezing. While examining tuple visibility, we'll set
 	 * set_all_visible to false if there are tuples on the page not visible to
-	 * all running and future transactions. set_all_visible is always
-	 * maintained but only VACUUM will set the VM if the page ends up being
-	 * all-visible.
+	 * all running and future transactions. If enabled for this scan, we will
+	 * set the VM if the page ends up being all-visible.
 	 *
 	 * We also keep track of the newest live XID, which is used to calculate
 	 * the snapshot conflict horizon for a WAL record setting the VM.
@@ -928,21 +934,37 @@ heap_page_fix_vm_corruption(PruneState *prstate, OffsetNumber offnum,
  * This function does not actually set the VM bits or page-level visibility
  * hint, PD_ALL_VISIBLE.
  *
+ * This should be called only after do_freeze has been decided (and do_prune
+ * has been set), as these factor into our heuristic-based decision.
+ *
  * Returns true if one or both VM bits should be set and false otherwise.
  */
 static bool
-heap_page_will_set_vm(PruneState *prstate, PruneReason reason)
+heap_page_will_set_vm(PruneState *prstate, PruneReason reason,
+					  bool do_prune, bool do_freeze)
 {
-	/*
-	 * Though on-access pruning maintains prstate->set_all_visible, we don't
-	 * set the VM for now.
-	 */
-	if (reason == PRUNE_ON_ACCESS)
+	if (!prstate->attempt_set_vm)
 		return false;
 
 	if (!prstate->set_all_visible)
 		return false;
 
+	/*
+	 * If this is an on-access call and we're not actually pruning, avoid
+	 * setting the visibility map if it would newly dirty the heap page or, if
+	 * the page is already dirty, if doing so would require including a
+	 * full-page image (FPI) of the heap page in the WAL. This situation
+	 * should be rare, as on-access pruning is only attempted when
+	 * pd_prune_xid is valid.
+	 */
+	if (reason == PRUNE_ON_ACCESS && !do_prune && !do_freeze &&
+		(!BufferIsDirty(prstate->buffer) || XLogCheckBufferNeedsBackup(prstate->buffer)))
+	{
+		prstate->set_all_visible = false;
+		prstate->set_all_frozen = false;
+		return false;
+	}
+
 	prstate->new_vmbits = VISIBILITYMAP_ALL_VISIBLE;
 
 	if (prstate->set_all_frozen)
@@ -1168,7 +1190,7 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
 	Assert(!prstate.set_all_frozen || prstate.set_all_visible);
 	Assert(!prstate.set_all_visible || (prstate.lpdead_items == 0));
 
-	do_set_vm = heap_page_will_set_vm(&prstate, params->reason);
+	do_set_vm = heap_page_will_set_vm(&prstate, params->reason, do_prune, do_freeze);
 
 	/*
 	 * new_vmbits should be 0 regardless of whether or not the page is
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 461fdf4ed83..37dba4cb3ec 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2033,7 +2033,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		.buffer = buf,
 		.vmbuffer = vmbuffer,
 		.reason = PRUNE_VACUUM_SCAN,
-		.options = HEAP_PAGE_PRUNE_FREEZE,
+		.options = HEAP_PAGE_PRUNE_FREEZE | HEAP_PAGE_PRUNE_SET_VM,
 		.vistest = vacrel->vistest,
 		.cutoffs = &vacrel->cutoffs,
 	};
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e32f28d7acb..78c85536d39 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -43,6 +43,7 @@
 #define HEAP_PAGE_PRUNE_MARK_UNUSED_NOW		(1 << 0)
 #define HEAP_PAGE_PRUNE_FREEZE				(1 << 1)
 #define HEAP_PAGE_PRUNE_ALLOW_FAST_PATH		(1 << 2)
+#define HEAP_PAGE_PRUNE_SET_VM				(1 << 3)
 
 typedef struct BulkInsertStateData *BulkInsertState;
 typedef struct GlobalVisState GlobalVisState;
@@ -95,7 +96,12 @@ typedef struct HeapScanDescData
 	 */
 	ParallelBlockTableScanWorkerData *rs_parallelworkerdata;
 
-	/* Current heap block's corresponding page in the visibility map */
+	/*
+	 * For sequential scans, bitmap heap scans, TID range scans, and sample
+	 * scans. The current heap block's corresponding page in the visibility
+	 * map. If the relation is not modified by the query, on-access pruning
+	 * may set the VM.
+	 */
 	Buffer		rs_vmbuffer;
 
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
@@ -126,7 +132,11 @@ typedef struct IndexFetchHeapData
 	 */
 	Buffer		xs_cbuf;
 
-	/* Current heap block's corresponding page in the visibility map */
+	/*
+	 * Current heap block's corresponding page in the visibility map. For
+	 * index scans that do not modify the underlying heap table, on-access
+	 * pruning may set the VM on-access.
+	 */
 	Buffer		xs_vmbuffer;
 } IndexFetchHeapData;
 
@@ -431,7 +441,7 @@ extern TransactionId heap_index_delete_tuples(Relation rel,
 
 /* in heap/pruneheap.c */
 extern void heap_page_prune_opt(Relation relation, Buffer buffer,
-								Buffer *vmbuffer);
+								Buffer *vmbuffer, bool rel_read_only);
 extern void heap_page_prune_and_freeze(PruneFreezeParams *params,
 									   PruneFreezeResult *presult,
 									   OffsetNumber *off_loc,
-- 
2.43.0



  [text/x-patch] v41-0012-Set-pd_prune_xid-on-insert.patch (8.8K, 13-v41-0012-Set-pd_prune_xid-on-insert.patch)
  download | inline diff:
From 99c5b17c45e413680919f2623fc528bb35873dc9 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Tue, 29 Jul 2025 16:12:56 -0400
Subject: [PATCH v41 12/12] Set pd_prune_xid on insert
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Now that visibility map (VM) updates can occur during read-only queries,
it makes sense to also set the page’s pd_prune_xid hint during inserts
and on the new page during updates.

This enables heap_page_prune_and_freeze() to set the VM all-visible
after a page is filled with newly inserted tuples the first time it is
read. This means the page will get set all-visible when it is still in
shared buffers and avoid potential I/O amplification when vacuum later
has to scan the page and set it all-visible. It also enables index-only
scans of newly inserted data much sooner.

This change also addresses a long-standing note in heap_insert() and
heap_multi_insert(), which observed that setting pd_prune_xid would
help clean up aborted insertions sooner. Without it, such tuples might
linger until VACUUM, whereas now they can be pruned earlier.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
---
 src/backend/access/heap/heapam.c      | 40 +++++++++++++++++----------
 src/backend/access/heap/heapam_xlog.c | 19 ++++++++++++-
 src/backend/access/heap/pruneheap.c   | 18 ++++++------
 3 files changed, 52 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index dbdf6521c42..ba11bbc03a5 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2156,6 +2156,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	TransactionId xid = GetCurrentTransactionId();
 	HeapTuple	heaptup;
 	Buffer		buffer;
+	Page		page;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		all_visible_cleared = false;
 
@@ -2182,6 +2183,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 									   &vmbuffer, NULL,
 									   0);
 
+	page = BufferGetPage(buffer);
+
 	/*
 	 * We're about to do the actual insert -- but check for conflict first, to
 	 * avoid possibly having to roll back work we've just done.
@@ -2205,25 +2208,30 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	RelationPutHeapTuple(relation, buffer, heaptup,
 						 (options & HEAP_INSERT_SPECULATIVE) != 0);
 
-	if (PageIsAllVisible(BufferGetPage(buffer)))
+	if (PageIsAllVisible(page))
 	{
 		all_visible_cleared = true;
-		PageClearAllVisible(BufferGetPage(buffer));
+		PageClearAllVisible(page);
 		visibilitymap_clear(relation,
 							ItemPointerGetBlockNumber(&(heaptup->t_self)),
 							vmbuffer, VISIBILITYMAP_VALID_BITS);
 	}
 
 	/*
-	 * XXX Should we set PageSetPrunable on this page ?
+	 * Set pd_prune_xid to trigger heap_page_prune_and_freeze() once the page
+	 * is full so that we can set the page all-visible in the VM on the next
+	 * page access.
 	 *
-	 * The inserting transaction may eventually abort thus making this tuple
-	 * DEAD and hence available for pruning. Though we don't want to optimize
-	 * for aborts, if no other tuple in this page is UPDATEd/DELETEd, the
-	 * aborted tuple will never be pruned until next vacuum is triggered.
+	 * Setting pd_prune_xid is also handy if the inserting transaction
+	 * eventually aborts making this tuple DEAD and hence available for
+	 * pruning. If no other tuple in this page is UPDATEd/DELETEd, the aborted
+	 * tuple would never otherwise be pruned until next vacuum is triggered.
 	 *
-	 * If you do add PageSetPrunable here, add it in heap_xlog_insert too.
+	 * Don't set it if we are in bootstrap mode or we are inserting a frozen
+	 * tuple, as there is no further pruning/freezing needed in those cases.
 	 */
+	if (TransactionIdIsNormal(xid) && !(options & HEAP_INSERT_FROZEN))
+		PageSetPrunable(page, xid);
 
 	MarkBufferDirty(buffer);
 
@@ -2233,7 +2241,6 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		xl_heap_insert xlrec;
 		xl_heap_header xlhdr;
 		XLogRecPtr	recptr;
-		Page		page = BufferGetPage(buffer);
 		uint8		info = XLOG_HEAP_INSERT;
 		int			bufflags = 0;
 
@@ -2598,8 +2605,13 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		}
 
 		/*
-		 * XXX Should we set PageSetPrunable on this page ? See heap_insert()
+		 * Set pd_prune_xid. See heap_insert() for more on why we do this when
+		 * inserting tuples. This only makes sense if we aren't already
+		 * setting the page frozen in the VM. We also don't set it in
+		 * bootstrap mode.
 		 */
+		if (!all_frozen_set && TransactionIdIsNormal(xid))
+			PageSetPrunable(page, xid);
 
 		MarkBufferDirty(buffer);
 
@@ -4141,12 +4153,12 @@ l2:
 	 * the subsequent page pruning will be a no-op and the hint will be
 	 * cleared.
 	 *
-	 * XXX Should we set hint on newbuf as well?  If the transaction aborts,
-	 * there would be a prunable tuple in the newbuf; but for now we choose
-	 * not to optimize for aborts.  Note that heap_xlog_update must be kept in
-	 * sync if this decision changes.
+	 * We set the new page prunable as well. See heap_insert() for more on why
+	 * we do this when inserting tuples.
 	 */
 	PageSetPrunable(page, xid);
+	if (newbuf != buffer)
+		PageSetPrunable(newpage, xid);
 
 	if (use_hot_update)
 	{
diff --git a/src/backend/access/heap/heapam_xlog.c b/src/backend/access/heap/heapam_xlog.c
index 1302bb13e18..f3f419d3dc1 100644
--- a/src/backend/access/heap/heapam_xlog.c
+++ b/src/backend/access/heap/heapam_xlog.c
@@ -450,6 +450,14 @@ heap_xlog_insert(XLogReaderState *record)
 
 		freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
 
+		/*
+		 * Set the page prunable to trigger on-access pruning later, which may
+		 * set the page all-visible in the VM. See comments in heap_insert().
+		 */
+		if (TransactionIdIsNormal(XLogRecGetXid(record)) &&
+			!HeapTupleHeaderXminFrozen(htup))
+			PageSetPrunable(page, XLogRecGetXid(record));
+
 		PageSetLSN(page, lsn);
 
 		if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
@@ -599,12 +607,19 @@ heap_xlog_multi_insert(XLogReaderState *record)
 		if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
 			PageClearAllVisible(page);
 
-		/* XLH_INSERT_ALL_FROZEN_SET implies that all tuples are visible */
+		/*
+		 * XLH_INSERT_ALL_FROZEN_SET implies that all tuples are visible. If
+		 * we are not setting the page frozen, then set the page's prunable
+		 * hint so that we trigger on-access pruning later which may set the
+		 * page all-visible in the VM.
+		 */
 		if (xlrec->flags & XLH_INSERT_ALL_FROZEN_SET)
 		{
 			PageSetAllVisible(page);
 			PageClearPrunable(page);
 		}
+		else
+			PageSetPrunable(page, XLogRecGetXid(record));
 
 		MarkBufferDirty(buffer);
 	}
@@ -921,6 +936,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
 		freespace = PageGetHeapFreeSpace(npage);
 
 		PageSetLSN(npage, lsn);
+		/* See heap_insert() for why we set pd_prune_xid on insert */
+		PageSetPrunable(npage, XLogRecGetXid(record));
 		MarkBufferDirty(nbuffer);
 	}
 
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index deb7b948c1e..9f8c83aa7d3 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -279,7 +279,8 @@ heap_page_prune_opt(Relation relation, Buffer buffer, Buffer *vmbuffer,
 	/*
 	 * First check whether there's any chance there's something to prune,
 	 * determining the appropriate horizon is a waste if there's no prune_xid
-	 * (i.e. no updates/deletes left potentially dead tuples around).
+	 * (i.e. no updates/deletes left potentially dead tuples around and no
+	 * inserts inserted new tuples that may be visible to all).
 	 */
 	prune_xid = PageGetPruneXid(page);
 	if (!TransactionIdIsValid(prune_xid))
@@ -1918,17 +1919,14 @@ heap_prune_record_unchanged_lp_normal(PruneState *prstate, OffsetNumber offnum)
 			prstate->set_all_visible = false;
 			prstate->set_all_frozen = false;
 
-			/* The page should not be marked all-visible */
-			if (PageIsAllVisible(page))
-				heap_page_fix_vm_corruption(prstate, offnum,
-											VM_CORRUPT_TUPLE_VISIBILITY);
-
 			/*
-			 * If we wanted to optimize for aborts, we might consider marking
-			 * the page prunable when we see INSERT_IN_PROGRESS.  But we
-			 * don't.  See related decisions about when to mark the page
-			 * prunable in heapam.c.
+			 * Though there is nothing "prunable" on the page, we maintain
+			 * pd_prune_xid for inserts so that we have the opportunity to
+			 * mark them all-visible during the next round of pruning.
 			 */
+			heap_prune_record_prunable(prstate,
+									   HeapTupleHeaderGetXmin(htup),
+									   offnum);
 			break;
 
 		case HEAPTUPLE_DELETE_IN_PROGRESS:
-- 
2.43.0



view thread (144+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)
  In-Reply-To: <CAAKRu_bsHbvt+VqbjHXsdphKf8fqwBEutRhH3fmo+qUVe=yBKw@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox