Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

public inbox for [email protected]  
help / color / mirror / Atom feed

Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
26+ messages / 4 participants
[nested] [flat]

* Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2023-12-15 19:07  Michail Nikolaev <[email protected]>
  0 siblings, 2 replies; 26+ messages in thread

From: Michail Nikolaev @ 2023-12-15 19:07 UTC (permalink / raw)
  To: pgsql-hackers; Alvaro Herrera <[email protected]>

Hello, hackers!

I think about revisiting (1) ({CREATE INDEX, REINDEX} CONCURRENTLY
improvements) in some lighter way.

Yes, a serious bug was (2) caused by this optimization and now it reverted.

But what about a more safe idea in that direction:
1) add new horizon which ignores PROC_IN_SAFE_IC backends and standbys queries
2) use this horizon for settings LP_DEAD bit in indexes (excluding
indexes being built of course)

Index LP_DEAD hints are not used by standby in any way (they are just
ignored), also heap scan done by index building does not use them as
well.

But, at the same time:
1) index scans will be much faster during index creation or standby
reporting queries
2) indexes can keep them fit using different optimizations
3) less WAL due to a huge amount of full pages writes (which caused by
tons of LP_DEAD in indexes)

The patch seems more-less easy to implement.
Does it worth being implemented? Or to scary?

[1]: https://postgr.es/m/[email protected]
[2]: https://postgr.es/m/17485-396609c6925b982d%40postgresql.org

^ permalink  raw  reply  [nested|flat] 26+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2023-12-15 21:11  Matthias van de Meent <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  1 sibling, 1 reply; 26+ messages in thread

From: Matthias van de Meent @ 2023-12-15 21:11 UTC (permalink / raw)
  To: Michail Nikolaev <[email protected]>; +Cc: pgsql-hackers; Alvaro Herrera <[email protected]>

On Fri, 15 Dec 2023, 20:07 Michail Nikolaev, <[email protected]>
wrote:

> Hello, hackers!
>
> I think about revisiting (1) ({CREATE INDEX, REINDEX} CONCURRENTLY
> improvements) in some lighter way.
>
> Yes, a serious bug was (2) caused by this optimization and now it reverted.
>
> But what about a more safe idea in that direction:
> 1) add new horizon which ignores PROC_IN_SAFE_IC backends and standbys
> queries
> 2) use this horizon for settings LP_DEAD bit in indexes (excluding
> indexes being built of course)
>
> Index LP_DEAD hints are not used by standby in any way (they are just
> ignored), also heap scan done by index building does not use them as
> well.
>
> But, at the same time:
> 1) index scans will be much faster during index creation or standby
> reporting queries
> 2) indexes can keep them fit using different optimizations
> 3) less WAL due to a huge amount of full pages writes (which caused by
> tons of LP_DEAD in indexes)
>
> The patch seems more-less easy to implement.
> Does it worth being implemented? Or to scary?
>

I hihgly doubt this is worth the additional cognitive overhead of another
liveness state, and I think there might be other issues with marking index
tuples dead in indexes before the table tuple is dead that I can't think of
right now.

I've thought about alternative solutions, too: how about getting a new
snapshot every so often?
We don't really care about the liveness of the already-scanned data; the
snapshots used for RIC are used only during the scan. C/RIC's relation's
lock level means vacuum can't run to clean up dead line items, so as long
as we only swap the backend's reported snapshot (thus xmin) while the scan
is between pages we should be able to reduce the time C/RIC is the one
backend holding back cleanup of old tuples.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

> [1]: https://postgr.es/m/[email protected]
> [2]: https://postgr.es/m/17485-396609c6925b982d%40postgresql.org
>
>
>

^ permalink  raw  reply  [nested|flat] 26+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2023-12-17 20:14  Michail Nikolaev <[email protected]>
  parent: Matthias van de Meent <[email protected]>
  0 siblings, 1 reply; 26+ messages in thread

From: Michail Nikolaev @ 2023-12-17 20:14 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: pgsql-hackers; Alvaro Herrera <[email protected]>

Hello!

> I've thought about alternative solutions, too: how about getting a new snapshot every so often?
> We don't really care about the liveness of the already-scanned data; the snapshots used for RIC
> are used only during the scan. C/RIC's relation's lock level means vacuum can't run to clean up
> dead line items, so as long as we only swap the backend's reported snapshot (thus xmin) while
> the scan is between pages we should be able to reduce the time C/RIC is the one backend
> holding back cleanup of old tuples.

Hm, it looks like an interesting idea! It may be more dangerous, but
at least it feels much more elegant than an LP_DEAD-related way.
Also, feels like we may apply this to both phases (first and the second scans).
The original patch (1) was helping only to the second one (after call
to set_indexsafe_procflags).

But for the first scan we allowed to do so only for non-unique indexes
because of:

> * The reason for doing that is to avoid
> * bogus unique-index failures due to concurrent UPDATEs (we might see
> * different versions of the same row as being valid when we pass over them,
> * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
> * does not contain any tuples added to the table while we built the index.

Also, (1) was limited to indexes without expressions and predicates
(2) because such may execute queries to other tables (sic!).
One possible solution is to add some checks to make sure no
user-defined functions are used.
But as far as I understand, it affects only CIC for now and does not
affect the ability to use the proposed technique (updating snapshot
time to time).

However, I think we need some more-less formal proof it is safe - it
is really challenging to keep all the possible cases in the head. I’ll
try to do something here.
Another possible issue may be caused by the new locking pattern - we
will be required to wait for all transaction started before the ending
of the phase to exit.

[1]: https://postgr.es/m/[email protected]
[2]: https://www.postgresql.org/message-id/flat/CAAaqYe_tq_Mtd9tdeGDsgQh%2BwMvouithAmcOXvCbLaH2PPGHvA%40m...

^ permalink  raw  reply  [nested|flat] 26+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2023-12-17 23:53  Matthias van de Meent <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 26+ messages in thread

From: Matthias van de Meent @ 2023-12-17 23:53 UTC (permalink / raw)
  To: Michail Nikolaev <[email protected]>; +Cc: pgsql-hackers; Alvaro Herrera <[email protected]>

On Sun, 17 Dec 2023, 21:14 Michail Nikolaev, <[email protected]> wrote:
>
> Hello!
>
> > I've thought about alternative solutions, too: how about getting a new snapshot every so often?
> > We don't really care about the liveness of the already-scanned data; the snapshots used for RIC
> > are used only during the scan. C/RIC's relation's lock level means vacuum can't run to clean up
> > dead line items, so as long as we only swap the backend's reported snapshot (thus xmin) while
> > the scan is between pages we should be able to reduce the time C/RIC is the one backend
> > holding back cleanup of old tuples.
>
> Hm, it looks like an interesting idea! It may be more dangerous, but
> at least it feels much more elegant than an LP_DEAD-related way.
> Also, feels like we may apply this to both phases (first and the second scans).
> The original patch (1) was helping only to the second one (after call
> to set_indexsafe_procflags).
>
> But for the first scan we allowed to do so only for non-unique indexes
> because of:
>
> > * The reason for doing that is to avoid
> > * bogus unique-index failures due to concurrent UPDATEs (we might see
> > * different versions of the same row as being valid when we pass over them,
> > * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
> > * does not contain any tuples added to the table while we built the index.

Yes, for that we'd need an extra scan of the index that validates
uniqueness. I think there was a proposal (though it may only have been
an idea) some time ago, about turning existing non-unique indexes into
unique ones by validating the data. Such a system would likely be very
useful to enable this optimization.

> Also, (1) was limited to indexes without expressions and predicates
> (2) because such may execute queries to other tables (sic!).

Note that the use of such expressions would be a violation of the
function's definition; it would depend on data from other tables which
makes the function behave like a STABLE function, as opposed to the
IMMUTABLE that is required for index expressions. So, I don't think we
should specially care about being correct for incorrectly marked
function definitions.

> One possible solution is to add some checks to make sure no
> user-defined functions are used.
> But as far as I understand, it affects only CIC for now and does not
> affect the ability to use the proposed technique (updating snapshot
> time to time).
>
> However, I think we need some more-less formal proof it is safe - it
> is really challenging to keep all the possible cases in the head. I’ll
> try to do something here.

I just realised there is one issue with this design: We can't cheaply
reset the snapshot during the second table scan:
It is critically important that the second scan of R/CIC uses an index
contents summary (made with index_bulk_delete) that was created while
the current snapshot was already registered. If we didn't do that, the
following would occur:

1. The index is marked ready for inserts from concurrent backends, but
not yet ready for queries.
2. We get the bulkdelete data
3. A concurrent backend inserts a new tuple T on heap page P, inserts
it into the index, and commits. This tuple is not in the summary, but
has been inserted into the index.
4. R/CIC resets the snapshot, making T visible.
5. R/CIC scans page P, finds that tuple T has to be indexed but is not
present in the summary, thus inserts that tuple into the index (which
already had it inserted at 3)

This thus would be a logic bug, as indexes assume at-most-once
semantics for index tuple insertion; duplicate insertion are an error.

So, the "reset the snapshot every so often" trick cannot be applied in
phase 3 (the rescan), or we'd have to do an index_bulk_delete call
every time we reset the snapshot. Rescanning might be worth the cost
(e.g. when using BRIN), but that is very unlikely.

Alternatively, we'd need to find another way to prevent us from
inserting these duplicate entries - maybe by storing the scan's data
in a buffer to later load into the index after another
index_bulk_delete()? Counterpoint: for BRIN indexes that'd likely
require a buffer much larger than the result index would be.

Either way, for the first scan (i.e. phase 2 "build new indexes") this
is not an issue: we don't care about what transaction adds/deletes
tuples at that point.
For all we know, all tuples of the table may be deleted concurrently
before we even allow concurrent backends to start inserting tuples,
and the algorithm would still work as it does right now.

> Another possible issue may be caused by the new locking pattern - we
> will be required to wait for all transaction started before the ending
> of the phase to exit.

What do you mean by "new locking pattern"? We already keep an
ShareUpdateExclusiveLock on every heap table we're accessing during
R/CIC, and that should already prevent any concurrent VACUUM
operations, right?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

^ permalink  raw  reply  [nested|flat] 26+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2023-12-20 09:56  Michail Nikolaev <[email protected]>
  parent: Matthias van de Meent <[email protected]>
  0 siblings, 1 reply; 26+ messages in thread

From: Michail Nikolaev @ 2023-12-20 09:56 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: pgsql-hackers; Alvaro Herrera <[email protected]>

Hello!

> Also, feels like we may apply this to both phases (first and the second scans).
> The original patch (1) was helping only to the second one (after call
> to set_indexsafe_procflags).

Oops, I was wrong here. The original version of the patch was also applied to
both phases.

> Note that the use of such expressions would be a violation of the
> function's definition; it would depend on data from other tables which
> makes the function behave like a STABLE function, as opposed to the
> IMMUTABLE that is required for index expressions. So, I don't think we
> should specially care about being correct for incorrectly marked
> function definitions.

Yes, but such cases could probably cause crashes also...
So, I think it is better to check them for custom functions. But I
still not sure -
if such limitations still required for proposed optimization or not.

> I just realised there is one issue with this design: We can't cheaply
> reset the snapshot during the second table scan:
> It is critically important that the second scan of R/CIC uses an index
> contents summary (made with index_bulk_delete) that was created while
> the current snapshot was already registered.

> So, the "reset the snapshot every so often" trick cannot be applied in
> phase 3 (the rescan), or we'd have to do an index_bulk_delete call
> every time we reset the snapshot. Rescanning might be worth the cost
> (e.g. when using BRIN), but that is very unlikely.

Hm, I think it is still possible. We could just manually recheck the
tuples we see
to the snapshot currently used for the scan. If an "old" snapshot can see
the tuple also (HeapTupleSatisfiesHistoricMVCC) then search for it in the
index summary.

> What do you mean by "new locking pattern"? We already keep an
> ShareUpdateExclusiveLock on every heap table we're accessing during
> R/CIC, and that should already prevent any concurrent VACUUM
> operations, right?

I was thinking not about "classical" locking, but about waiting for
other backends
by WaitForLockers(heaplocktag, ShareLock, true). But I think
everything should be
fine.

Best regards,
Michail.






^ permalink  raw  reply  [nested|flat] 26+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2023-12-20 11:14  Matthias van de Meent <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 26+ messages in thread

From: Matthias van de Meent @ 2023-12-20 11:14 UTC (permalink / raw)
  To: Michail Nikolaev <[email protected]>; +Cc: pgsql-hackers; Alvaro Herrera <[email protected]>

On Wed, 20 Dec 2023 at 10:56, Michail Nikolaev
<[email protected]> wrote:
> > Note that the use of such expressions would be a violation of the
> > function's definition; it would depend on data from other tables which
> > makes the function behave like a STABLE function, as opposed to the
> > IMMUTABLE that is required for index expressions. So, I don't think we
> > should specially care about being correct for incorrectly marked
> > function definitions.
>
> Yes, but such cases could probably cause crashes also...
> So, I think it is better to check them for custom functions. But I
> still not sure -
> if such limitations still required for proposed optimization or not.

I think contents could be inconsistent, but not more inconsistent than
if the index was filled across multiple transactions using inserts.
Either way I don't see it breaking more things that are not already
broken in that way in other places - at most it will introduce another
path that exposes the broken state caused by mislabeled functions.

> > I just realised there is one issue with this design: We can't cheaply
> > reset the snapshot during the second table scan:
> > It is critically important that the second scan of R/CIC uses an index
> > contents summary (made with index_bulk_delete) that was created while
> > the current snapshot was already registered.
>
> > So, the "reset the snapshot every so often" trick cannot be applied in
> > phase 3 (the rescan), or we'd have to do an index_bulk_delete call
> > every time we reset the snapshot. Rescanning might be worth the cost
> > (e.g. when using BRIN), but that is very unlikely.
>
> Hm, I think it is still possible. We could just manually recheck the
> tuples we see
> to the snapshot currently used for the scan. If an "old" snapshot can see
> the tuple also (HeapTupleSatisfiesHistoricMVCC) then search for it in the
> index summary.
That's an interesting method.

How would this deal with tuples not visible to the old snapshot?
Presumably we can assume they're newer than that snapshot (the old
snapshot didn't have it, but the new one does, so it's committed after
the old snapshot, making them newer), so that backend must have
inserted it into the index already, right?

> HeapTupleSatisfiesHistoricMVCC

That function has this comment marker:
   "Only usable on tuples from catalog tables!"
Is that correct even for this?

Should this deal with any potential XID wraparound, too?
How does this behave when the newly inserted tuple's xmin gets frozen?
This would be allowed to happen during heap page pruning, afaik - no
rules that I know of which are against that - but it would create
issues where normal snapshot visibility rules would indicate it
visible to both snapshots regardless of whether it actually was
visible to the older snapshot when that snapshot was created...

Either way, "Historic snapshot" isn't something I've worked with
before, so that goes onto my "figure out how it works" pile.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

^ permalink  raw  reply  [nested|flat] 26+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2023-12-20 16:18  Michail Nikolaev <[email protected]>
  parent: Matthias van de Meent <[email protected]>
  0 siblings, 1 reply; 26+ messages in thread

From: Michail Nikolaev @ 2023-12-20 16:18 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: pgsql-hackers; Alvaro Herrera <[email protected]>

Hello!

> How would this deal with tuples not visible to the old snapshot?
> Presumably we can assume they're newer than that snapshot (the old
> snapshot didn't have it, but the new one does, so it's committed after
> the old snapshot, making them newer), so that backend must have
> inserted it into the index already, right?

Yes, exactly.

>> HeapTupleSatisfiesHistoricMVCC
> That function has this comment marker:
> "Only usable on tuples from catalog tables!"
> Is that correct even for this?

Yeah, we just need HeapTupleSatisfiesVisibility (which calls
HeapTupleSatisfiesMVCC) instead.

> Should this deal with any potential XID wraparound, too?

Yeah, looks like we should care about such case somehow.

Possible options here:

1) Skip vac_truncate_clog while CIC is running. In fact, I think it's
not that much worse than the current state - datfrozenxid is still
updated in the catalog and will be considered the next time
vac_update_datfrozenxid is called (the next VACCUM on any table).

2) Delay vac_truncate_clog while CIC is running.
In such a case, if it was skipped, we will need to re-run it using the
index builds backend later.

3) Wait for 64-bit xids :)

4) Any ideas?

In addition, for the first and second options, we need logic to cancel
the second phase in the case of ForceTransactionIdLimitUpdate.
But maybe I'm missing something and the tuples may be frozen, ignoring
the set datfrozenxid values (over some horizon calculated at runtime
based on the xmin backends).

> How does this behave when the newly inserted tuple's xmin gets frozen?
> This would be allowed to happen during heap page pruning, afaik - no
> rules that I know of which are against that - but it would create
> issues where normal snapshot visibility rules would indicate it
> visible to both snapshots regardless of whether it actually was
> visible to the older snapshot when that snapshot was created...

Yes, good catch.
Assuming we have somehow prevented vac_truncate_clog from occurring
during CIC, we can leave frozen and potentially frozen
(xmin<frozenXID) for the second phase.

So, first phase processing items:
* not frozen
* xmin>frozenXID (may not be frozen)
* visible by snapshot

second phase:
* frozen
* xmin>frozenXID (may be frozen)
* not in the index summary
* visible by "old" snapshot

You might also think – why is the first stage needed at all? Just use
batch processing during initial index building?

Best regards,
Mikhail.

^ permalink  raw  reply  [nested|flat] 26+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2023-12-20 16:53  Michail Nikolaev <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 26+ messages in thread

From: Michail Nikolaev @ 2023-12-20 16:53 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: pgsql-hackers; Alvaro Herrera <[email protected]>

> Yes, good catch.
> Assuming we have somehow prevented vac_truncate_clog from occurring
> during CIC, we can leave frozen and potentially frozen
> (xmin<frozenXID) for the second phase.

Just realized that we can leave this for the first stage to improve efficiency.
Since the ID is locked, anything that can be frozen will be visible in
the first stage.






^ permalink  raw  reply  [nested|flat] 26+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2023-12-21 13:14  Michail Nikolaev <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 26+ messages in thread

From: Michail Nikolaev @ 2023-12-21 13:14 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: pgsql-hackers; Alvaro Herrera <[email protected]>

Hello.

Realized my last idea is invalid (because tuples are frozen by using
dynamically calculated horizon) - so, don't waste your time on it :)

Need to think a little bit more here.

Thanks,
Mikhail.






^ permalink  raw  reply  [nested|flat] 26+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2023-12-25 14:12  Michail Nikolaev <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 26+ messages in thread

From: Michail Nikolaev @ 2023-12-25 14:12 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: pgsql-hackers; Alvaro Herrera <[email protected]>

Hello!

It seems like the idea of "old" snapshot is still a valid one.

> Should this deal with any potential XID wraparound, too?

As far as I understand in our case, we are not affected by this in any way.
Vacuum in our table is not possible because of locking, so, nothing
may be frozen (see below).
In the case of super long index building, transactional limits will
stop new connections using current
regular infrastructure because it is based on relation data (but not
actual xmin of backends).

> How does this behave when the newly inserted tuple's xmin gets frozen?
> This would be allowed to happen during heap page pruning, afaik - no
> rules that I know of which are against that - but it would create
> issues where normal snapshot visibility rules would indicate it
> visible to both snapshots regardless of whether it actually was
> visible to the older snapshot when that snapshot was created...

As I can see, heap_page_prune never freezes any tuples.
In the case of regular vacuum, it used this way: call heap_page_prune
and then call heap_prepare_freeze_tuple and then
heap_freeze_execute_prepared.

Merry Christmas,
Mikhail.

^ permalink  raw  reply  [nested|flat] 26+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-01-04 11:24  Matthias van de Meent <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 26+ messages in thread

From: Matthias van de Meent @ 2024-01-04 11:24 UTC (permalink / raw)
  To: Michail Nikolaev <[email protected]>; +Cc: pgsql-hackers; Alvaro Herrera <[email protected]>

On Mon, 25 Dec 2023 at 15:12, Michail Nikolaev
<[email protected]> wrote:
>
> Hello!
>
> It seems like the idea of "old" snapshot is still a valid one.
>
> > Should this deal with any potential XID wraparound, too?
>
> As far as I understand in our case, we are not affected by this in any way.
> Vacuum in our table is not possible because of locking, so, nothing
> may be frozen (see below).
> In the case of super long index building, transactional limits will
> stop new connections using current
> regular infrastructure because it is based on relation data (but not
> actual xmin of backends).
>
> > How does this behave when the newly inserted tuple's xmin gets frozen?
> > This would be allowed to happen during heap page pruning, afaik - no
> > rules that I know of which are against that - but it would create
> > issues where normal snapshot visibility rules would indicate it
> > visible to both snapshots regardless of whether it actually was
> > visible to the older snapshot when that snapshot was created...
>
> As I can see, heap_page_prune never freezes any tuples.
> In the case of regular vacuum, it used this way: call heap_page_prune
> and then call heap_prepare_freeze_tuple and then
> heap_freeze_execute_prepared.

Correct, but there are changes being discussed where we would freeze
tuples during pruning as well [0], which would invalidate that
implementation detail. And, if I had to choose between improved
opportunistic freezing and improved R/CIC, I'd probably choose
improved freezing over R/CIC.

As an alternative, we _could_ keep track of concurrent index inserts
using a dummy index (with the same predicate) which only holds the
TIDs of the inserted tuples. We'd keep it as an empty index in phase
1, and every time we reset the visibility snapshot we now only need to
scan that index to know what tuples were concurrently inserted. This
should have a significantly lower IO overhead than repeated full index
bulkdelete scans for the new index in the second table scan phase of
R/CIC. However, in a worst case it could still require another
O(tablesize) of storage.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] https://www.postgresql.org/message-id/[email protected]...

^ permalink  raw  reply  [nested|flat] 26+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-01-04 12:45  Michail Nikolaev <[email protected]>
  parent: Matthias van de Meent <[email protected]>
  0 siblings, 1 reply; 26+ messages in thread

From: Michail Nikolaev @ 2024-01-04 12:45 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: pgsql-hackers; Alvaro Herrera <[email protected]>

Hello!

> Correct, but there are changes being discussed where we would freeze
> tuples during pruning as well [0], which would invalidate that
> implementation detail. And, if I had to choose between improved
> opportunistic freezing and improved R/CIC, I'd probably choose
> improved freezing over R/CIC.

As another option, we could extract a dedicated horizon value for an
opportunistic freezing.
And use some flags in R/CIC backend to keep it at the required value.

Best regards,
Michail.






^ permalink  raw  reply  [nested|flat] 26+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-01-09 17:00  Michail Nikolaev <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 26+ messages in thread

From: Michail Nikolaev @ 2024-01-09 17:00 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: pgsql-hackers; Alvaro Herrera <[email protected]>; Matthias van de Meent <[email protected]>

Hello, Melanie!

Sorry to interrupt you, just a quick question.

> Correct, but there are changes being discussed where we would freeze
> tuples during pruning as well [0], which would invalidate that
> implementation detail. And, if I had to choose between improved
> opportunistic freezing and improved R/CIC, I'd probably choose
> improved freezing over R/CIC.

Do you have any patches\threads related to that refactoring
(opportunistic freezing of tuples during pruning) [0]?
This may affect the idea of the current thread (latest version of it
mostly in [1]) - it may be required to disable such a feature for
particular relation temporary or affect horizon used for pruning
(without holding xmin).

Just no sure - is it reasonable to start coding right now, or wait for
some prune-freeze-related patch first?

[0] https://www.postgresql.org/message-id/[email protected]...
[1] https://www.postgresql.org/message-id/flat/CANtu0ojRX%3DosoiXL9JJG6g6qOowXVbVYX%2BmDsN%2B2jmFVe%3DeG...

^ permalink  raw  reply  [nested|flat] 26+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-02-01 16:06  Michail Nikolaev <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 26+ messages in thread

From: Michail Nikolaev @ 2024-02-01 16:06 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: pgsql-hackers; Alvaro Herrera <[email protected]>; Matthias van de Meent <[email protected]>

> > > I just realised there is one issue with this design: We can't cheaply
> > > reset the snapshot during the second table scan:
> > > It is critically important that the second scan of R/CIC uses an index
> > > contents summary (made with index_bulk_delete) that was created while
> > > the current snapshot was already registered.
> >
> > > So, the "reset the snapshot every so often" trick cannot be applied in
> > > phase 3 (the rescan), or we'd have to do an index_bulk_delete call
> > > every time we reset the snapshot. Rescanning might be worth the cost
> > > (e.g. when using BRIN), but that is very unlikely.
> >
> > Hm, I think it is still possible. We could just manually recheck the
> > tuples we see
> > to the snapshot currently used for the scan. If an "old" snapshot can see
> > the tuple also (HeapTupleSatisfiesHistoricMVCC) then search for it in the
> > index summary.
> That's an interesting method.
>
> How would this deal with tuples not visible to the old snapshot?
> Presumably we can assume they're newer than that snapshot (the old
> snapshot didn't have it, but the new one does, so it's committed after
> the old snapshot, making them newer), so that backend must have
> inserted it into the index already, right?

I made a draft of the patch and this idea is not working.

The problem is generally the same:

* reference snapshot sees tuple X
* reference snapshot is used to create index summary (but there is no
tuple X in the index summary)
* tuple X is updated to Y creating a HOT-chain
* we started scan with new temporary snapshot (it sees Y, X is too old for it)
* tuple X is pruned from HOT-chain because it is not protected by any snapshot
* we see tuple Y in the scan with temporary snapshot
    * it is not in the index summary - so, we need to check if
reference snapshot can see it
    * there is no way to understand if the reference snapshot was able
to see tuple X - because we need the full HOT chain (with X tuple) for
that

Best regards,
Michail.






^ permalink  raw  reply  [nested|flat] 26+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-02-17 21:48  Matthias van de Meent <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 26+ messages in thread

From: Matthias van de Meent @ 2024-02-17 21:48 UTC (permalink / raw)
  To: Michail Nikolaev <[email protected]>; +Cc: Melanie Plageman <[email protected]>; pgsql-hackers; Alvaro Herrera <[email protected]>

On Thu, 1 Feb 2024, 17:06 Michail Nikolaev, <[email protected]> wrote:
>
> > > > I just realised there is one issue with this design: We can't cheaply
> > > > reset the snapshot during the second table scan:
> > > > It is critically important that the second scan of R/CIC uses an index
> > > > contents summary (made with index_bulk_delete) that was created while
> > > > the current snapshot was already registered.

I think the best way for this to work would be an index method that
exclusively stores TIDs, and of which we can quickly determine new
tuples, too. I was thinking about something like GIN's format, but
using (generation number, tid) instead of ([colno, colvalue], tid) as
key data for the internal trees, and would be unlogged (because the
data wouldn't have to survive a crash). Then we could do something
like this for the second table scan phase:

0. index->indisready is set
[...]
1. Empty the "changelog index", resetting storage and the generation number.
2. Take index contents snapshot of new index, store this.
3. Loop until completion:
4a. Take visibility snapshot
4b. Update generation number of the changelog index, store this.
4c. Take index snapshot of "changelog index" for data up to the
current stored generation number. Not including, because we only need
to scan that part of the index that were added before we created our
visibility snapshot, i.e. TIDs labeled with generation numbers between
the previous iteration's generation number (incl.) and this
iteration's generation (excl.).
4d. Combine the current index snapshot with that of the "changelog"
index, and save this.
    Note that this needs to take care to remove duplicates.
4e. Scan segment of table (using the combined index snapshot) until we
need to update our visibility snapshot or have scanned the whole
table.

This should give similar, if not the same, behavour as that which we
have when we RIC a table with several small indexes, without requiring
us to scan a full index of data several times.

Attemp on proving this approach's correctness:
In phase 3, after each step 4b:
All matching tuples of the table that are in the visibility snapshot:
* Were created before scan 1's snapshot, thus in the new index's snapshot, or
* Were created after scan 1's snapshot but before index->indisready,
thus not in the new index's snapshot, nor in the changelog index, or
* Were created after the index was set as indisready, and committed
before the previous iteration's visibility snapshot, thus in the
combined index snapshot, or
* Were created after the index was set as indisready, after the
previous visibility snapshot was taken, but before the current
visibility snapshot was taken, and thus definitely included in the
changelog index.

Because we hold a snapshot, no data in the table that we should see is
removed, so we don't have a chance of broken HOT chains.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

^ permalink  raw  reply  [nested|flat] 26+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-02-20 23:33  Michail Nikolaev <[email protected]>
  parent: Matthias van de Meent <[email protected]>
  0 siblings, 2 replies; 26+ messages in thread

From: Michail Nikolaev @ 2024-02-20 23:33 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Melanie Plageman <[email protected]>; pgsql-hackers; Alvaro Herrera <[email protected]>

Hello!

> I think the best way for this to work would be an index method that
> exclusively stores TIDs, and of which we can quickly determine new
> tuples, too. I was thinking about something like GIN's format, but
> using (generation number, tid) instead of ([colno, colvalue], tid) as
> key data for the internal trees, and would be unlogged (because the
> data wouldn't have to survive a crash)

Yeah, this seems to be a reasonable approach, but there are some
doubts related to it - it needs new index type as well as unlogged
indexes to be introduced - this may make the patch too invasive to be
merged. Also, some way to remove the index from the catalog in case of
a crash may be required.

A few more thoughts:
* it is possible to go without generation number - we may provide a
way to do some kind of fast index lookup (by TID) directly during the
second table scan phase.
* one more option is to maintain a Tuplesorts (instead of an index)
with TIDs as changelog and merge with index snapshot after taking a
new visibility snapshot. But it is not clear how to share the same
Tuplesort with multiple inserting backends.
* crazy idea - what is about to do the scan in the index we are
building? We have tuple, so, we have all the data indexed in the
index. We may try to do an index scan using that data to get all
tuples and find the one with our TID :) Yes, in some cases it may be
too bad because of the huge amount of TIDs we need to scan + also
btree copies whole page despite we need single item. But some
additional index method may help - feels like something related to
uniqueness (but it is only in btree anyway).

Thanks,
Mikhail.

^ permalink  raw  reply  [nested|flat] 26+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-02-21 08:35  Michail Nikolaev <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  1 sibling, 1 reply; 26+ messages in thread

From: Michail Nikolaev @ 2024-02-21 08:35 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Melanie Plageman <[email protected]>; pgsql-hackers; Alvaro Herrera <[email protected]>

One more idea - is just forbid HOT prune while the second phase is
running. It is not possible anyway currently because of snapshot held.

Possible enhancements:
* we may apply restriction only to particular tables
* we may apply restrictions only to part of the tables (not yet
scanned by R/CICs).

Yes, it is not an elegant solution, limited, not reliable in terms of
architecture, but a simple one.

^ permalink  raw  reply  [nested|flat] 26+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-02-21 11:01  Matthias van de Meent <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  1 sibling, 0 replies; 26+ messages in thread

From: Matthias van de Meent @ 2024-02-21 11:01 UTC (permalink / raw)
  To: Michail Nikolaev <[email protected]>; +Cc: Melanie Plageman <[email protected]>; pgsql-hackers; Alvaro Herrera <[email protected]>

On Wed, 21 Feb 2024 at 00:33, Michail Nikolaev
<[email protected]> wrote:
>
> Hello!
> > I think the best way for this to work would be an index method that
> > exclusively stores TIDs, and of which we can quickly determine new
> > tuples, too. I was thinking about something like GIN's format, but
> > using (generation number, tid) instead of ([colno, colvalue], tid) as
> > key data for the internal trees, and would be unlogged (because the
> > data wouldn't have to survive a crash)
>
> Yeah, this seems to be a reasonable approach, but there are some
> doubts related to it - it needs new index type as well as unlogged
> indexes to be introduced - this may make the patch too invasive to be
> merged.

I suppose so, though persistence is usually just to keep things
correct in case of crashes, and this "index" is only there to support
processes that don't expect to survive crashes.

> Also, some way to remove the index from the catalog in case of
> a crash may be required.

That's less of an issue though, we already accept that a crash during
CIC/RIC leaves unusable indexes around, so "needs more cleanup" is not
exactly a blocker.

> A few more thoughts:
> * it is possible to go without generation number - we may provide a
> way to do some kind of fast index lookup (by TID) directly during the
> second table scan phase.

While possible, I don't think this would be more performant than the
combination approach, at the cost of potentially much more random IO
when the table is aggressively being updated.

> * one more option is to maintain a Tuplesorts (instead of an index)
> with TIDs as changelog and merge with index snapshot after taking a
> new visibility snapshot. But it is not clear how to share the same
> Tuplesort with multiple inserting backends.

Tuplesort requires the leader process to wait for concurrent backends
to finish their sort before it can start consuming their runs. This
would make it a very bad alternative to the "changelog index" as the
CIC process would require on-demand actions from concurrent backends
(flush of sort state). I'm not convinced that's somehow easier.

> * crazy idea - what is about to do the scan in the index we are
> building? We have tuple, so, we have all the data indexed in the
> index. We may try to do an index scan using that data to get all
> tuples and find the one with our TID :)

We can't rely on that, because we have no guarantee we can find the
tuple quickly enough. Equality-based indexing is very much optional,
and so are TID-based checks (outside the current vacuum-related APIs),
so finding one TID can (and probably will) take O(indexsize) when the
tuple is not in the index, which is one reason for ambulkdelete() to
exist.

Kind regards,

Matthias van de Meent

^ permalink  raw  reply  [nested|flat] 26+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-02-21 11:19  Matthias van de Meent <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 26+ messages in thread

From: Matthias van de Meent @ 2024-02-21 11:19 UTC (permalink / raw)
  To: Michail Nikolaev <[email protected]>; +Cc: Melanie Plageman <[email protected]>; pgsql-hackers; Alvaro Herrera <[email protected]>

On Wed, 21 Feb 2024 at 09:35, Michail Nikolaev
<[email protected]> wrote:
>
> One more idea - is just forbid HOT prune while the second phase is
> running. It is not possible anyway currently because of snapshot held.
>
> Possible enhancements:
> * we may apply restriction only to particular tables
> * we may apply restrictions only to part of the tables (not yet
> scanned by R/CICs).
>
> Yes, it is not an elegant solution, limited, not reliable in terms of
> architecture, but a simple one.

How do you suppose this would work differently from a long-lived
normal snapshot, which is how it works right now?
Would it be exclusively for that relation? How would this be
integrated with e.g. heap_page_prune_opt?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)






^ permalink  raw  reply  [nested|flat] 26+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-02-21 11:36  Michail Nikolaev <[email protected]>
  parent: Matthias van de Meent <[email protected]>
  0 siblings, 0 replies; 26+ messages in thread

From: Michail Nikolaev @ 2024-02-21 11:36 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Melanie Plageman <[email protected]>; pgsql-hackers; Alvaro Herrera <[email protected]>

Hi!

> How do you suppose this would work differently from a long-lived
> normal snapshot, which is how it works right now?

Difference in the ability to take new visibility snapshot periodically
during the second phase with rechecking visibility of tuple according
to the "reference" snapshot (which is taken only once like now).
It is the approach from (1) but with a workaround for the issues
caused by heap_page_prune_opt.

> Would it be exclusively for that relation?
Yes, only for that affected relation. Other relations are unaffected.

> How would this be integrated with e.g. heap_page_prune_opt?
Probably by some flag in RelationData, but not sure here yet.

If the idea looks sane, I could try to extend my POC - it should be
not too hard, likely (I already have tests to make sure it is
correct).

(1): https://www.postgresql.org/message-id/flat/CANtu0oijWPRGRpaRR_OvT2R5YALzscvcOTFh-%3DuZKUpNJmuZtw%40m...

^ permalink  raw  reply  [nested|flat] 26+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2025-11-27 16:56  Antonin Houska <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  1 sibling, 2 replies; 26+ messages in thread

From: Antonin Houska @ 2025-11-27 16:56 UTC (permalink / raw)
  To: Michail Nikolaev <[email protected]>; +Cc: pgsql-hackers; Alvaro Herrera <[email protected]>

Michail Nikolaev <[email protected]> wrote:

> I think about revisiting (1) ({CREATE INDEX, REINDEX} CONCURRENTLY
> improvements) in some lighter way.

I haven't read the whole thread yet, but the effort to minimize the impact of
C/RIC on VACUUM seems to prevail. Following is one more proposal. The core
idea is that C/RIC should avoid indexing dead tuples, however snapshot is not
necessary to distinguish dead tuple from a live one. And w/o snapshot, the
backend executing C/RIC does not restrict VACUUM on other tables.

Concurrent (re)build of unique index appears to be another topic of this
thread, but I think this approach should handle the problem too. The workflow
is:

1. Create an empty index.

2. Wait until all transactions are aware of the index, so they take the new
   index into account when deciding on new HOT chains. (This is already
   implemented.)

3. Set the 'indisready' flag so the index is ready for insertions.

4. While other transactions can insert their tuples into the index now,
   process the table one page at a time this way:

   4.1 Acquire (shared) content lock on the buffer.

   4.3 Collect the root tuples of HOT chains - these and only these need to be
       inserted into the index.

   4.4 Unlock the buffer.

5. Once the whole table is processed, insert the collected tuples into the
   index.

   To avoid insertions of tuples that concurrent transactions have just
   inserted, we'd need something like index.c:validate_index() (i.e. insert
   into the index only the tuples that it does not contain yet), but w/o
   snapshot because we already have the heap tuples collected.

   Also it'd make sense to wait for completion of all the transactions that
   currently have the table locked for INSERT/UPDATE: some of these might have
   inserted their tuples into the heap, but not yet into the index. If we
   included some of those tuples into our collection and insert them into the
   index first, the other transactions could end up with ERROR when inserting
   those tuples again.

6. Set the 'indisvalid' flag so that the index can be used by queries.

Note on pruning: As we only deal with the root tuples of HOT chains (4.3),
page pruning triggered by queries (heap_page_prune_opt) should not be
disruptive. Actually C/RIC can do the pruning itself it it appears to be
useful. For example, if whole HOT chain should be considered DEAD by the next
VACUUM, pruning is likely (depending on the OldestXid) to remove it so that we
do not insert TID of the root tuple into the index unnecessarily.

I can even think of letting VACUUM run on the same table that C/RIC is
processing. In that case, interlocking would take place at page level: either
C/RIC or VACUUM can acquire lock for particular page, but not both. This would
be useful in cases C/RIC takes very long time.

In this case, C/RIC *must not* insert TIDs of dead tuples into the index at
all. Otherwise there could be race conditions such that VACUUM removes dead
tuples from the index and marks the corresponding heap items as UNUSED, but
C/RIC then re-inserts the index tuples.

To avoid this problem, C/RIC needs to re-check each TID before it inserts it
into the index and skip the insertion if the tuple (or the whole HOT-chain
starting at this tuple) it points to is DEAD according to the OldestXmin that
the most recent VACUUM used. (VACUUM could perhaps advertise its OldestXmin
for C/RIC via shared memory.)

Also, before this re-checking starts, it must be ensured that VACUUM does not
start again, until the index creation is complete: a new run of VACUUM implies
a new value of OldestXmin, i.e. need for more stringent re-checking of the
heap tuples.

Related question is which OldestXmin to use in the step 4.3. One option is to
use *exactly* the OldestXmin shared VACUUM. However that wouldn't work if
VACUUM starts while C/RIC is already in progress. (Which seems like a
significant restriction.)

Another option is to get the OldestXmin in the same way as VACUUM
does. However, the value can thus be different from the one used by VACUUM:
older if retrieved before VACUUM started and newer if retrieved while VACUUM
was already running. The first case can be handled by the heap tuple
re-checking (see above). The latter implies that, before setting 'indisvalid',
C/RIC has to wait until all snapshots have their xmin >= this (more recent)
OldestXmin. Otherwise some snapshots could miss data they should see.

(An implication of the rule that C/RIC must not insert TIDs of dead tuples
into the index is that VACUUM does not have to call the index AM bulk delete
while C/RIC is running for that index. This would be just an optimization.)

Of course, I could have missed some important point, so please explain why
this concept is broken :-) Or let me know if something needs to be explained
more in detail. Thanks.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

^ permalink  raw  reply  [nested|flat] 26+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2025-11-27 17:40  Mihail Nikalayeu <[email protected]>
  parent: Antonin Houska <[email protected]>
  1 sibling, 1 reply; 26+ messages in thread

From: Mihail Nikalayeu @ 2025-11-27 17:40 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: pgsql-hackers; Alvaro Herrera <[email protected]>

Hello, Antonin!

> I haven't read the whole thread yet, but the effort to minimize the impact of
> C/RIC on VACUUM seems to prevail
Yes, the thread is super long and probably you missed annotations to
most important emails in [0].

> Of course, I could have missed some important point, so please explain why
> this concept is broken :-) Or let me know if something needs to be explained
> more in detail. Thanks.

Looks like your idea is not broken, but... It is actually an almost
1-1 to idea used in the "full" version of the patch.
Explanations are available in [1] and [2].
In [3] I reduced the patch scope to find a solution compatible with REPACK.

Few comments:

> 1. Create an empty index.
Yes, patch does exactly the same, introducing special lightweight AM -
STIR (Short Term Index Replacement) to collect new tuples.

> 4.1 Acquire (shared) content lock on the buffer.
>  4.3 Collect the root tuples of HOT chains - these and only these need to be
       inserted into the index.
>   4.4 Unlock the buffer.

Instead of such technique essentially the same is used - it keeps the
snapshot to be used, it just rotates it every few pages for a fresh
one.
It solves some of the issues with selection of alive tuples you
mentioned without any additional logic.

> Concurrent (re)build of unique index appears to be another topic of this
> thread, but I think this approach should handle the problem too.
It is solved with a special commit in the original patchset.

You know, clever people think the same :)
Interesting fact, it is not the first time - at [4] Sergey also
proposed an idea of an "empty" index to collect tuples (which gives
the single scan).

So, it is very good knews the approach feels valid for multiple people
(also Mathias introduced the idea about "fresh snapshot"~"no snapshot"
initially).

One thing I am not happy about - it is not applicable to the REPACK case.

Best regards,
Mikhail.

[0]: https://commitfest.postgresql.org/patch/4971/
[1]: https://www.postgresql.org/message-id/[email protected]...
[2]: https://www.postgresql.org/message-id/[email protected]...
[3]: https://www.postgresql.org/message-id/[email protected]...
[4]: https://www.postgresql.org/message-id/flat/CAMAof6_FY0MrNJOuBrqvQqJKiwskFvjRtgpVHf-D7A%3DKvTtYXg%40m...

^ permalink  raw  reply  [nested|flat] 26+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2025-11-27 18:22  Matthias van de Meent <[email protected]>
  parent: Antonin Houska <[email protected]>
  1 sibling, 1 reply; 26+ messages in thread

From: Matthias van de Meent @ 2025-11-27 18:22 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Michail Nikolaev <[email protected]>; pgsql-hackers; Alvaro Herrera <[email protected]>

On Thu, 27 Nov 2025 at 17:56, Antonin Houska <[email protected]> wrote:
>
> Michail Nikolaev <[email protected]> wrote:
>
> > I think about revisiting (1) ({CREATE INDEX, REINDEX} CONCURRENTLY
> > improvements) in some lighter way.
>
> I haven't read the whole thread yet, but the effort to minimize the impact of
> C/RIC on VACUUM seems to prevail. Following is one more proposal. The core
> idea is that C/RIC should avoid indexing dead tuples, however snapshot is not
> necessary to distinguish dead tuple from a live one. And w/o snapshot, the
> backend executing C/RIC does not restrict VACUUM on other tables.
>
> Concurrent (re)build of unique index appears to be another topic of this
> thread, but I think this approach should handle the problem too. The workflow
> is:
>
> 1. Create an empty index.
>
> 2. Wait until all transactions are aware of the index, so they take the new
>    index into account when deciding on new HOT chains. (This is already
>    implemented.)
>
> 3. Set the 'indisready' flag so the index is ready for insertions.
>
> 4. While other transactions can insert their tuples into the index now,
>    process the table one page at a time this way:
>
>    4.1 Acquire (shared) content lock on the buffer.
>
>    4.3 Collect the root tuples of HOT chains - these and only these need to be
>        inserted into the index.
>
>    4.4 Unlock the buffer.


> 5. Once the whole table is processed, insert the collected tuples into the
>    index.
>
>    To avoid insertions of tuples that concurrent transactions have just
>    inserted, we'd need something like index.c:validate_index() (i.e. insert
>    into the index only the tuples that it does not contain yet), but w/o
>    snapshot because we already have the heap tuples collected.
>
>    Also it'd make sense to wait for completion of all the transactions that
>    currently have the table locked for INSERT/UPDATE: some of these might have
>    inserted their tuples into the heap, but not yet into the index. If we
>    included some of those tuples into our collection and insert them into the
>    index first, the other transactions could end up with ERROR when inserting
>    those tuples again.
>
> 6. Set the 'indisvalid' flag so that the index can be used by queries.
>
> Note on pruning: As we only deal with the root tuples of HOT chains (4.3),
> page pruning triggered by queries (heap_page_prune_opt) should not be
> disruptive. Actually C/RIC can do the pruning itself it it appears to be
> useful. For example, if whole HOT chain should be considered DEAD by the next
> VACUUM, pruning is likely (depending on the OldestXid) to remove it so that we
> do not insert TID of the root tuple into the index unnecessarily.
[...]
> Of course, I could have missed some important point, so please explain why
> this concept is broken :-) Or let me know if something needs to be explained
> more in detail. Thanks.

1. When do you select and insert tuples that aren't part of a hot
chain into the index, i.e. tuples that were never updated after they
got inserted into the table? Or is every tuple "part of a hot chain"
even if the tuple wasn't ever updated?

2. HOT chains can be created while the index wasn't yet present, and
thus the indexed attributes of the root tuples can be different from
the most current tuple of a chain. If you only gather root tuples, we
could index incorrect data for that HOT chain. The correct approach
here is to index only the visible tuples, as those won't have been
updated in a non-HOT manner without all indexed attributes being
unchanged.

3. Having the index marked indisready before it contains any data is
going to slow down the indexing process significantly:
a. The main index build now must go through shared memory and buffer
locking, instead of being able to use backend-local memory
b. The tuple-wise insertion path (IndexAmRoutine->aminsert) can have a
significantly higher overhead than the bulk insertion logic in
ambuild(); in metrics of WAL, pages accessed (IO), and CPU cycles
spent.

So, I don't think moving away from ambuild() as basis for initially
building the index this is such a great idea.

(However, I do think that having an _option_ to build the index using
ambuildempty()+aminsert() instead of ambuild() might be useful, if
only to more easily compare "natural grown" indexes vs freshly built
ones, but that's completely orthogonal to CIC snapshotting
improvements.)


Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)





^ permalink  raw  reply  [nested|flat] 26+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2025-11-27 18:57  Mihail Nikalayeu <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  0 siblings, 0 replies; 26+ messages in thread

From: Mihail Nikalayeu @ 2025-11-27 18:57 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: pgsql-hackers; Alvaro Herrera <[email protected]>

Hi, Antonin!

On Thu, Nov 27, 2025 at 6:40 PM Mihail Nikalayeu
<[email protected]> wrote:
> > 1. Create an empty index.
> Yes, patch does exactly the same, introducing special lightweight AM -
> STIR (Short Term Index Replacement) to collect new tuples.

Initially understood incorrectly - in your solution you propose to use
a single index.
But STIR is used to collect new coming tuples, while the main index is
built using a batched way.

> To avoid insertions of tuples that concurrent transactions have just
> inserted, we'd need something like index.c:validate_index() (i.e. insert
> into the index only the tuples that it does not contain yet), but w/o
> snapshot because we already have the heap tuples collected.

And later main and STIR are merged.

Best regards,
Mikhail.





^ permalink  raw  reply  [nested|flat] 26+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2025-11-28 09:05  Antonin Houska <[email protected]>
  parent: Matthias van de Meent <[email protected]>
  0 siblings, 1 reply; 26+ messages in thread

From: Antonin Houska @ 2025-11-28 09:05 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Michail Nikolaev <[email protected]>; pgsql-hackers; Alvaro Herrera <[email protected]>

Matthias van de Meent <[email protected]> wrote:

> On Thu, 27 Nov 2025 at 17:56, Antonin Houska <[email protected]> wrote:
> >
> > Michail Nikolaev <[email protected]> wrote:
> >
> > > I think about revisiting (1) ({CREATE INDEX, REINDEX} CONCURRENTLY
> > > improvements) in some lighter way.
> >
> > I haven't read the whole thread yet, but the effort to minimize the impact of
> > C/RIC on VACUUM seems to prevail. Following is one more proposal. The core
> > idea is that C/RIC should avoid indexing dead tuples, however snapshot is not
> > necessary to distinguish dead tuple from a live one. And w/o snapshot, the
> > backend executing C/RIC does not restrict VACUUM on other tables.
> >
> > Concurrent (re)build of unique index appears to be another topic of this
> > thread, but I think this approach should handle the problem too. The workflow
> > is:
> >
> > 1. Create an empty index.
> >
> > 2. Wait until all transactions are aware of the index, so they take the new
> >    index into account when deciding on new HOT chains. (This is already
> >    implemented.)
> >
> > 3. Set the 'indisready' flag so the index is ready for insertions.
> >
> > 4. While other transactions can insert their tuples into the index now,
> >    process the table one page at a time this way:
> >
> >    4.1 Acquire (shared) content lock on the buffer.
> >
> >    4.3 Collect the root tuples of HOT chains - these and only these need to be
> >        inserted into the index.
> >
> >    4.4 Unlock the buffer.
> 
> 
> > 5. Once the whole table is processed, insert the collected tuples into the
> >    index.
> >
> >    To avoid insertions of tuples that concurrent transactions have just
> >    inserted, we'd need something like index.c:validate_index() (i.e. insert
> >    into the index only the tuples that it does not contain yet), but w/o
> >    snapshot because we already have the heap tuples collected.
> >
> >    Also it'd make sense to wait for completion of all the transactions that
> >    currently have the table locked for INSERT/UPDATE: some of these might have
> >    inserted their tuples into the heap, but not yet into the index. If we
> >    included some of those tuples into our collection and insert them into the
> >    index first, the other transactions could end up with ERROR when inserting
> >    those tuples again.
> >
> > 6. Set the 'indisvalid' flag so that the index can be used by queries.
> >
> > Note on pruning: As we only deal with the root tuples of HOT chains (4.3),
> > page pruning triggered by queries (heap_page_prune_opt) should not be
> > disruptive. Actually C/RIC can do the pruning itself it it appears to be
> > useful. For example, if whole HOT chain should be considered DEAD by the next
> > VACUUM, pruning is likely (depending on the OldestXid) to remove it so that we
> > do not insert TID of the root tuple into the index unnecessarily.
> [...]
> > Of course, I could have missed some important point, so please explain why
> > this concept is broken :-) Or let me know if something needs to be explained
> > more in detail. Thanks.
> 
> 1. When do you select and insert tuples that aren't part of a hot
> chain into the index, i.e. tuples that were never updated after they
> got inserted into the table? Or is every tuple "part of a hot chain"
> even if the tuple wasn't ever updated?

Right, I considered "standalone tuple" a HOT chain of length 1. So it'll be
picked too.

> 2. HOT chains can be created while the index wasn't yet present, and
> thus the indexed attributes of the root tuples can be different from
> the most current tuple of a chain. If you only gather root tuples, we
> could index incorrect data for that HOT chain. The correct approach
> here is to index only the visible tuples, as those won't have been
> updated in a non-HOT manner without all indexed attributes being
> unchanged.

Good point.

> 3. Having the index marked indisready before it contains any data is
> going to slow down the indexing process significantly:
> a. The main index build now must go through shared memory and buffer
> locking, instead of being able to use backend-local memory
> b. The tuple-wise insertion path (IndexAmRoutine->aminsert) can have a
> significantly higher overhead than the bulk insertion logic in
> ambuild(); in metrics of WAL, pages accessed (IO), and CPU cycles
> spent.
> 
> So, I don't think moving away from ambuild() as basis for initially
> building the index this is such a great idea.
> 
> (However, I do think that having an _option_ to build the index using
> ambuildempty()+aminsert() instead of ambuild() might be useful, if
> only to more easily compare "natural grown" indexes vs freshly built
> ones, but that's completely orthogonal to CIC snapshotting
> improvements.)

The retail insertions are not something this proposal depends on. I think it'd
be possible to build a separate index locally and then "merge" it with the
regular one. I just tried to propose a solution that does not need snapshots.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 26+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2025-11-28 09:51  Matthias van de Meent <[email protected]>
  parent: Antonin Houska <[email protected]>
  0 siblings, 0 replies; 26+ messages in thread

From: Matthias van de Meent @ 2025-11-28 09:51 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Michail Nikolaev <[email protected]>; pgsql-hackers; Alvaro Herrera <[email protected]>

On Fri, 28 Nov 2025 at 10:05, Antonin Houska <[email protected]> wrote:
> Matthias van de Meent <[email protected]> wrote:
> > 3. Having the index marked indisready before it contains any data is
> > going to slow down the indexing process significantly:
> > a. The main index build now must go through shared memory and buffer
> > locking, instead of being able to use backend-local memory
> > b. The tuple-wise insertion path (IndexAmRoutine->aminsert) can have a
> > significantly higher overhead than the bulk insertion logic in
> > ambuild(); in metrics of WAL, pages accessed (IO), and CPU cycles
> > spent.
> >
> > So, I don't think moving away from ambuild() as basis for initially
> > building the index this is such a great idea.
> >
> > (However, I do think that having an _option_ to build the index using
> > ambuildempty()+aminsert() instead of ambuild() might be useful, if
> > only to more easily compare "natural grown" indexes vs freshly built
> > ones, but that's completely orthogonal to CIC snapshotting
> > improvements.)
>
> The retail insertions are not something this proposal depends on. I think it'd
> be possible to build a separate index locally and then "merge" it with the
> regular one. I just tried to propose a solution that does not need snapshots.

I'm not sure we can generalize indexes to the point where merging two
built indexes is always both possible and efficient.

For example, the ANN indexes of pgvector (both HNSW and IVF) could
possibly have merge operations between indexes of the same type and
schema, but it would require a lot of effort on the side of the AM to
support merging; there is no trivial merge operation that also retains
the quality of the index without going through the aminsert() path.
Conversely, the current approach to CIC doesn't require additional
work on the index AM's side, and that's a huge enabler for every kind
of index.


Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)





^ permalink  raw  reply  [nested|flat] 26+ messages in thread

end of thread, other threads:[~2025-11-28 09:51 UTC | newest]

Thread overview: 26+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2023-12-15 19:07 Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
2023-12-15 21:11 ` Matthias van de Meent <[email protected]>
2023-12-17 20:14   ` Michail Nikolaev <[email protected]>
2023-12-17 23:53     ` Matthias van de Meent <[email protected]>
2023-12-20 09:56       ` Michail Nikolaev <[email protected]>
2023-12-20 11:14         ` Matthias van de Meent <[email protected]>
2023-12-20 16:18           ` Michail Nikolaev <[email protected]>
2023-12-20 16:53             ` Michail Nikolaev <[email protected]>
2023-12-21 13:14               ` Michail Nikolaev <[email protected]>
2023-12-25 14:12                 ` Michail Nikolaev <[email protected]>
2024-01-04 11:24                   ` Matthias van de Meent <[email protected]>
2024-01-04 12:45                     ` Michail Nikolaev <[email protected]>
2024-01-09 17:00                       ` Michail Nikolaev <[email protected]>
2024-02-01 16:06                         ` Michail Nikolaev <[email protected]>
2024-02-17 21:48                           ` Matthias van de Meent <[email protected]>
2024-02-20 23:33                             ` Michail Nikolaev <[email protected]>
2024-02-21 08:35                               ` Michail Nikolaev <[email protected]>
2024-02-21 11:19                                 ` Matthias van de Meent <[email protected]>
2024-02-21 11:36                                   ` Michail Nikolaev <[email protected]>
2024-02-21 11:01                               ` Matthias van de Meent <[email protected]>
2025-11-27 16:56 ` Antonin Houska <[email protected]>
2025-11-27 17:40   ` Mihail Nikalayeu <[email protected]>
2025-11-27 18:57     ` Mihail Nikalayeu <[email protected]>
2025-11-27 18:22   ` Matthias van de Meent <[email protected]>
2025-11-28 09:05     ` Antonin Houska <[email protected]>
2025-11-28 09:51       ` Matthias van de Meent <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox