Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
5+ messages / 2 participants
[nested] [flat]

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-03-05 20:08  Matthias van de Meent <[email protected]>
  0 siblings, 1 reply; 5+ messages in thread

From: Matthias van de Meent @ 2024-03-05 20:08 UTC (permalink / raw)
  To: Michail Nikolaev <[email protected]>; +Cc: Melanie Plageman <[email protected]>; pgsql-hackers; Alvaro Herrera <[email protected]>

On Wed, 21 Feb 2024 at 12:37, Michail Nikolaev
<[email protected]> wrote:
>
> Hi!
>
> > How do you suppose this would work differently from a long-lived
> > normal snapshot, which is how it works right now?
>
> Difference in the ability to take new visibility snapshot periodically
> during the second phase with rechecking visibility of tuple according
> to the "reference" snapshot (which is taken only once like now).
> It is the approach from (1) but with a workaround for the issues
> caused by heap_page_prune_opt.
>
> > Would it be exclusively for that relation?
> Yes, only for that affected relation. Other relations are unaffected.

I suppose this could work. We'd also need to be very sure that the
toast relation isn't cleaned up either: Even though that's currently
DELETE+INSERT only and can't apply HOT, it would be an issue if we
couldn't find the TOAST data of a deleted for everyone (but visible to
us) tuple.

Note that disabling cleanup for a relation will also disable cleanup
of tuple versions in that table that are not used for the R/CIC
snapshots, and that'd be an issue, too.

> > How would this be integrated with e.g. heap_page_prune_opt?
> Probably by some flag in RelationData, but not sure here yet.
>
> If the idea looks sane, I could try to extend my POC - it should be
> not too hard, likely (I already have tests to make sure it is
> correct).

I'm not a fan of this approach. Changing visibility and cleanup
semantics to only benefit R/CIC sounds like a pain to work with in
essentially all visibility-related code. I'd much rather have to deal
with another index AM, even if it takes more time: the changes in
semantics will be limited to a new plug in the index AM system and a
behaviour change in R/CIC, rather than behaviour that changes in all
visibility-checking code.

But regardless of second scan snapshots, I think we can worry about
that part at a later moment: The first scan phase is usually the most
expensive and takes the most time of all phases that hold snapshots,
and in the above discussion we agreed that we can already reduce the
time that a snapshot is held during that phase significantly. Sure, it
isn't great that we have to scan the table again with only a single
snapshot, but generally phase 2 doesn't have that much to do (except
when BRIN indexes are involved) so this is likely less of an issue.
And even if it is, we would still have reduced the number of
long-lived snapshots by half.

-Matthias

^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-03-07 18:36  Michail Nikolaev <[email protected]>
  parent: Matthias van de Meent <[email protected]>
  0 siblings, 1 reply; 5+ messages in thread

From: Michail Nikolaev @ 2024-03-07 18:36 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Melanie Plageman <[email protected]>; pgsql-hackers; Alvaro Herrera <[email protected]>

Hello!

> I'm not a fan of this approach. Changing visibility and cleanup
> semantics to only benefit R/CIC sounds like a pain to work with in
> essentially all visibility-related code. I'd much rather have to deal
> with another index AM, even if it takes more time: the changes in
> semantics will be limited to a new plug in the index AM system and a
> behaviour change in R/CIC, rather than behaviour that changes in all
> visibility-checking code.

Technically, this does not affect the visibility logic, only the
clearing semantics.
All visibility related code remains untouched.
But yes, still an inelegant and a little strange-looking option.

At the same time, perhaps it can be dressed in luxury
somehow - for example, add as a first class citizen in ComputeXidHorizonsResult
a list of blocks to clear some relations.

> But regardless of second scan snapshots, I think we can worry about
> that part at a later moment: The first scan phase is usually the most
> expensive and takes the most time of all phases that hold snapshots,
> and in the above discussion we agreed that we can already reduce the
> time that a snapshot is held during that phase significantly. Sure, it
> isn't great that we have to scan the table again with only a single
> snapshot, but generally phase 2 doesn't have that much to do (except
> when BRIN indexes are involved) so this is likely less of an issue.
> And even if it is, we would still have reduced the number of
> long-lived snapshots by half.

Hmm, but it looks like we don't have the infrastructure to "update" xmin
propagating to the horizon after the first snapshot in a transaction is taken.

One option I know of is to reuse the
d9d076222f5b94a85e0e318339cfc44b8f26022d (1) approach.
But if this is the case, then there is no point in re-taking the
snapshot again during the first
phase - just apply this "if" only for the first phase - and you're done.

Do you know any less-hacky way? Or is it a nice way to go?

[1]: https://github.com/postgres/postgres/commit/d9d076222f5b94a85e0e318339cfc44b8f26022d#diff-8879f0173b...

^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-03-12 11:50  Matthias van de Meent <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 5+ messages in thread

From: Matthias van de Meent @ 2024-03-12 11:50 UTC (permalink / raw)
  To: Michail Nikolaev <[email protected]>; +Cc: Melanie Plageman <[email protected]>; pgsql-hackers; Alvaro Herrera <[email protected]>

On Thu, 7 Mar 2024 at 19:37, Michail Nikolaev
<[email protected]> wrote:
>
> Hello!
>
> > I'm not a fan of this approach. Changing visibility and cleanup
> > semantics to only benefit R/CIC sounds like a pain to work with in
> > essentially all visibility-related code. I'd much rather have to deal
> > with another index AM, even if it takes more time: the changes in
> > semantics will be limited to a new plug in the index AM system and a
> > behaviour change in R/CIC, rather than behaviour that changes in all
> > visibility-checking code.
>
> Technically, this does not affect the visibility logic, only the
> clearing semantics.
> All visibility related code remains untouched.

Yeah, correct. But it still needs to update the table relations'
information after finishing creating the indexes, which I'd rather not
have to do.

> But yes, still an inelegant and a little strange-looking option.
>
> At the same time, perhaps it can be dressed in luxury
> somehow - for example, add as a first class citizen in ComputeXidHorizonsResult
> a list of blocks to clear some relations.

Not sure what you mean here, but I don't think
ComputeXidHorizonsResult should have anything to do with actual
relations.

> > But regardless of second scan snapshots, I think we can worry about
> > that part at a later moment: The first scan phase is usually the most
> > expensive and takes the most time of all phases that hold snapshots,
> > and in the above discussion we agreed that we can already reduce the
> > time that a snapshot is held during that phase significantly. Sure, it
> > isn't great that we have to scan the table again with only a single
> > snapshot, but generally phase 2 doesn't have that much to do (except
> > when BRIN indexes are involved) so this is likely less of an issue.
> > And even if it is, we would still have reduced the number of
> > long-lived snapshots by half.
>
> Hmm, but it looks like we don't have the infrastructure to "update" xmin
> propagating to the horizon after the first snapshot in a transaction is taken.

We can just release the current snapshot, and get a new one, right? I
mean, we don't actually use the transaction for much else than
visibility during the first scan, and I don't think there is a need
for an actual transaction ID until we're ready to mark the index entry
with indisready.

> One option I know of is to reuse the
> d9d076222f5b94a85e0e318339cfc44b8f26022d (1) approach.
> But if this is the case, then there is no point in re-taking the
> snapshot again during the first
> phase - just apply this "if" only for the first phase - and you're done.

Not a fan of that, as it is too sensitive to abuse. Note that
extensions will also have access to these tools, and I think we should
build a system here that's not easy to break, rather than one that is.

> Do you know any less-hacky way? Or is it a nice way to go?

I suppose we could be resetting the snapshot every so often? Or use
multiple successive TID range scans with a new snapshot each?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)






^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-05-04 15:51  Michail Nikolaev <[email protected]>
  parent: Matthias van de Meent <[email protected]>
  0 siblings, 1 reply; 5+ messages in thread

From: Michail Nikolaev @ 2024-05-04 15:51 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Melanie Plageman <[email protected]>; pgsql-hackers

Hello, Matthias!

> We can just release the current snapshot, and get a new one, right? I
> mean, we don't actually use the transaction for much else than
> visibility during the first scan, and I don't think there is a need
> for an actual transaction ID until we're ready to mark the index entry
> with indisready.

> I suppose we could be resetting the snapshot every so often? Or use
> multiple successive TID range scans with a new snapshot each?

It seems like it is not so easy in that case. Because we still need to hold
catalog snapshot xmin, releasing the snapshot which used for the scan does
not affect xmin propagated to the horizon.
That's why d9d076222f5b94a85e0e318339cfc44b8f26022d(1) affects only the
data horizon, but not the catalog's one.

So, in such a situation, we may:

1) starts scan from scratch with some TID range multiple times. But such an
approach feels too complex and error-prone for me.

2) split horizons propagated by `MyProc` to data-related xmin and
catalog-related xmin. Like `xmin` and `catalogXmin`. We may just mark
snapshots as affecting some of the horizons, or both. Such a change feels
easy to be done but touches pretty core logic, so we need someone's
approval for such a proposal, probably.

3) provide some less invasive (but less non-kludge) way: add some kind of
process flag like `PROC_IN_SAFE_IC_XMIN` and function like
`AdvanceIndexSafeXmin` which changes the way backend affect horizon
calculation. In the case of `PROC_IN_SAFE_IC_XMIN` `ComputeXidHorizons`
uses value from `proc->safeIcXmin` which is updated by
`AdvanceIndexSafeXmin` while switching scan snapshots.

So, with option 2 or 3, we may avoid holding data horizon during the first
phase scan by resetting the scan snapshot every so often (and, optionally,
using `AdvanceIndexSafeXmin` in case of 3rd approach).

The same will be possible for the second phase (validate).

We may do the same "resetting the snapshot every so often" technique, but
there is still the issue with the way we distinguish tuples which were
missed by the first phase scan or were inserted into the index after the
visibility snapshot was taken.

So, I see two options here:

1) approach with additional index with some custom AM proposed by you.

   It looks correct and reliable but feels complex to implement and
maintain. Also, it negatively affects performance of table access (because
of an additional index) and validation scan (because we need to merge
additional index content with visibility snapshot).

2) one more tricky approach.

We may add some boolean flag to `Relation` about information of index
building in progress (`indexisbuilding`).

It may be easily calculated using `(index->indisready &&
!index->indisvalid)`. For a more reliable solution, we also need to somehow
check if backend/transaction building the index still in progress. Also, it
is better to check if index is building concurrently using the "safe_index"
way.

I think there is a non too complex and expensive way to do so, probably by
addition of some flag to index catalog record.

Once we have such a flag, we may "legally" prohibit `heap_page_prune_opt`
affecting the relation updating `GlobalVisHorizonKindForRel` like this:

   if (rel != NULL && rel->rd_indexvalid && rel->rd_indexisbuilding)
           return VISHORIZON_CATALOG;

So, in common it works this way:

* backend building the index affects catalog horizon as usual, but data
horizon is regularly propagated forward during the scan. So, other
relations are processed by vacuum and `heap_page_prune_opt` without any
restrictions

* but our relation (with CIC in progress) accessed by `heap_page_prune_opt`
(or any other vacuum-like mechanics) with catalog horizon to honor CIC
work. Therefore, validating scan may be sure what none of the HOT-chain
will be truncated. Even regular vacuum can't affect it (but yes, it can't
be anyway because of relation locking).

As a result, we may easily distinguish tuples missed by first phase scan,
just by testing them against reference snapshot (which used to take
visibility snapshot).

So, for me, this approach feels non-kludge enough, safe and effective and
the same time.

I have a prototype of this approach and looks like it works (I have a good
test catching issues with index content for CIC).

What do you think about all this?

[1]:
https://github.com/postgres/postgres/commit/d9d076222f5b94a85e0e318339cfc44b8f26022d#diff-8879f0173b...

^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-05-05 23:37  Michail Nikolaev <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 0 replies; 5+ messages in thread

From: Michail Nikolaev @ 2024-05-05 23:37 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Melanie Plageman <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>

Hello, Matthias!

I just realized there is a much simpler and safe way to deal with the
problem.

So, d9d076222f5b94a85e0e318339cfc44b8f26022d(1) had a bug because the scan
was not protected by a snapshot. At the same time, we want this snapshot to
affect not all the relations, but only a subset of them. And there is
already a proper way to achieve that - different types of visibility
horizons!

So, to resolve the issue, we just need to create a separated horizon value
for such situation as building an index concurrently.

For now, let's name it `VISHORIZON_BUILD_INDEX_CONCURRENTLY` for example.
By default, its value is equal to `VISHORIZON_DATA`. But in some cases it
"stops" propagating forward while concurrent index is building, like this:

            h->create_index_concurrently_oldest_nonremovable
=TransactionIdOlder(h->create_index_concurrently_oldest_nonremovable, xmin);
            if (!(statusFlags & PROC_IN_SAFE_IC))
                        h->data_oldest_nonremovable =
TransactionIdOlder(h->data_oldest_nonremovable, xmin);

The `PROC_IN_SAFE_IC` marks backend xmin as ignored by `VISHORIZON_DATA`
but not by `VISHORIZON_BUILD_INDEX_CONCURRENTLY`.

After, we need to use appropriate horizon for relations which are processed
by `PROC_IN_SAFE_IC` backends. There are a few ways to do it, we may start
prototyping with `rd_indexisbuilding` from previous message:

        static inline GlobalVisHorizonKind
        GlobalVisHorizonKindForRel(Relation rel)
        ........
            if (rel != NULL && rel->rd_indexvalid &&
rel->rd_indexisbuilding)
                    return VISHORIZON_BUILD_INDEX_CONCURRENTLY;


There are few more moments need to be considered:

* Does it move the horizon backwards?

It is allowed for the horizon to move backwards (like said in
`ComputeXidHorizons`) but anyway - in that case the horizon for particular
relations just starts to lag behind the horizon for other relations.
Invariant is like that: `VISHORIZON_BUILD_INDEX_CONCURRENTLY` <=
`VISHORIZON_DATA` <= `VISHORIZON_CATALOG` <= `VISHORIZON_SHARED`.

* What is about old cached versions of `Relation` objects without
`rd_indexisbuilding` yet set?

This is not a problem because once the backend registers a new index, it
waits for all transactions without that knowledge to end
(`WaitForLockers`). So, new ones will also get information about new
horizon for that particular relation.

* What is about TOAST?
To keep TOAST horizon aligned with relation building the index, we may do
the next thing (as first implementation iteration):

        else if (rel != NULL && ((rel->rd_indexvalid &&
rel->rd_indexisbuilding) || IsToastRelation(rel)))
                return VISHORIZON_BUILD_INDEX_CONCURRENTLY;

For the normal case, `VISHORIZON_BUILD_INDEX_CONCURRENTLY` is equal to
`VISHORIZON_DATA` - nothing is changed at all. But while the concurrent
index is building, the TOAST horizon is guaranteed to be aligned with its
parent relation. And yes, it is better to find an easy way to affect only
TOAST relations related to the relation with index building in progress.

New horizon adds some complexity, but not too much, in my opinion. I am
pretty sure it is worth being done because the ability to rebuild indexes
without performance degradation is an extremely useful feature.
Things to be improved:
* better way to track relations with concurrent indexes being built (with
mechanics to understood what index build was failed)
* better way to affect TOAST tables only related to concurrent index build
* better naming

Patch prototype in attachment.
Also, maybe it is worth committing test separately - it was based on Andrey
Borodin work (2). The test fails well in the case of incorrect
implementation.

[1]:
https://github.com/postgres/postgres/commit/d9d076222f5b94a85e0e318339cfc44b8f26022d#diff-8879f0173b...
[2]: https://github.com/x4m/postgres_g/commit/d0651e7d0d14862d5a4dac076355


Attachments:

  [text/x-patch] v1-0001-WIP-fix-d9d076222f5b-VACUUM-ignore-indexing-opera.patch (17.5K, 3-v1-0001-WIP-fix-d9d076222f5b-VACUUM-ignore-indexing-opera.patch)
  download | inline diff:
From b463dd180ab5820ac5b4144ff22622b2a6340b09 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Mon, 6 May 2024 01:08:49 +0200
Subject: [PATCH v1] WIP: fix d9d076222f5b "VACUUM: ignore indexing operations
 with CONCURRENTLY" which was reverted by e28bb8851969.

Introduce new type of visibility horizon to be used for relation with concurrently build indexes (in the case of "safe" index).
---
 src/backend/catalog/index.c              |   3 +
 src/backend/storage/ipc/procarray.c      |  72 ++++++++++-
 src/backend/utils/cache/relcache.c       |   6 +
 src/bin/pg_amcheck/t/006_concurrently.pl | 155 +++++++++++++++++++++++
 src/include/utils/rel.h                  |   1 +
 5 files changed, 231 insertions(+), 6 deletions(-)
 create mode 100644 src/bin/pg_amcheck/t/006_concurrently.pl

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 5a8568c55c..6ad9254d49 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3320,6 +3320,9 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 
 	/* Open and lock the parent heap relation */
 	heapRelation = table_open(heapId, ShareUpdateExclusiveLock);
+	/* Load information about the building indexes */
+	RelationGetIndexList(heapRelation);
+	Assert(heapRelation->rd_indexisbuilding);
 
 	/*
 	 * Switch to the table owner's userid, so that any index functions are run
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 1a83c4220b..a2fe173bb6 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -236,6 +236,12 @@ typedef struct ComputeXidHorizonsResult
 	 */
 	TransactionId data_oldest_nonremovable;
 
+	/*
+	 * Oldest xid for which deleted tuples need to be retained in normal user
+	 * defined tables with index building in progress.
+	 */
+	TransactionId create_index_concurrently_oldest_nonremovable;
+
 	/*
 	 * Oldest xid for which deleted tuples need to be retained in this
 	 * session's temporary tables.
@@ -251,6 +257,7 @@ typedef enum GlobalVisHorizonKind
 	VISHORIZON_SHARED,
 	VISHORIZON_CATALOG,
 	VISHORIZON_DATA,
+	VISHORIZON_BUILD_INDEX_CONCURRENTLY,
 	VISHORIZON_TEMP,
 } GlobalVisHorizonKind;
 
@@ -297,6 +304,7 @@ static TransactionId standbySnapshotPendingXmin;
 static GlobalVisState GlobalVisSharedRels;
 static GlobalVisState GlobalVisCatalogRels;
 static GlobalVisState GlobalVisDataRels;
+static GlobalVisState GlobalVisBuildIndexConcurrentlyRels;
 static GlobalVisState GlobalVisTempRels;
 
 /*
@@ -1727,9 +1735,6 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	bool		in_recovery = RecoveryInProgress();
 	TransactionId *other_xids = ProcGlobal->xids;
 
-	/* inferred after ProcArrayLock is released */
-	h->catalog_oldest_nonremovable = InvalidTransactionId;
-
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	h->latest_completed = TransamVariables->latestCompletedXid;
@@ -1749,7 +1754,9 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 
 		h->oldest_considered_running = initial;
 		h->shared_oldest_nonremovable = initial;
+		h->catalog_oldest_nonremovable = initial;
 		h->data_oldest_nonremovable = initial;
+		h->create_index_concurrently_oldest_nonremovable = initial;
 
 		/*
 		 * Only modifications made by this backend affect the horizon for
@@ -1847,11 +1854,28 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 			(statusFlags & PROC_AFFECTS_ALL_HORIZONS) ||
 			in_recovery)
 		{
-			h->data_oldest_nonremovable =
-				TransactionIdOlder(h->data_oldest_nonremovable, xmin);
+			h->create_index_concurrently_oldest_nonremovable =
+					TransactionIdOlder(h->create_index_concurrently_oldest_nonremovable, xmin);
+
+			if (!(statusFlags & PROC_IN_SAFE_IC))
+				h->data_oldest_nonremovable =
+					TransactionIdOlder(h->data_oldest_nonremovable, xmin);
+
+			/* Catalog tables need to consider all backends in this db */
+			h->catalog_oldest_nonremovable =
+				TransactionIdOlder(h->catalog_oldest_nonremovable, xmin);
+
 		}
 	}
 
+	/* catalog horizon should never be later than data */
+	Assert(TransactionIdPrecedesOrEquals(h->catalog_oldest_nonremovable,
+										 h->data_oldest_nonremovable));
+
+	/* data horizon should never be later than index building horizon */
+	Assert(TransactionIdPrecedesOrEquals(h->create_index_concurrently_oldest_nonremovable,
+										 h->data_oldest_nonremovable));
+
 	/*
 	 * If in recovery fetch oldest xid in KnownAssignedXids, will be applied
 	 * after lock is released.
@@ -1873,6 +1897,10 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 			TransactionIdOlder(h->shared_oldest_nonremovable, kaxmin);
 		h->data_oldest_nonremovable =
 			TransactionIdOlder(h->data_oldest_nonremovable, kaxmin);
+		h->create_index_concurrently_oldest_nonremovable =
+				TransactionIdOlder(h->create_index_concurrently_oldest_nonremovable, kaxmin);
+		h->catalog_oldest_nonremovable =
+			TransactionIdOlder(h->catalog_oldest_nonremovable, kaxmin);
 		/* temp relations cannot be accessed in recovery */
 	}
 
@@ -1880,6 +1908,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 										 h->shared_oldest_nonremovable));
 	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
 										 h->data_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
+										 h->create_index_concurrently_oldest_nonremovable));
 
 	/*
 	 * Check whether there are replication slots requiring an older xmin.
@@ -1888,6 +1918,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		TransactionIdOlder(h->shared_oldest_nonremovable, h->slot_xmin);
 	h->data_oldest_nonremovable =
 		TransactionIdOlder(h->data_oldest_nonremovable, h->slot_xmin);
+	h->create_index_concurrently_oldest_nonremovable =
+			TransactionIdOlder(h->create_index_concurrently_oldest_nonremovable, h->slot_xmin);
 
 	/*
 	 * The only difference between catalog / data horizons is that the slot's
@@ -1900,7 +1932,9 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	h->shared_oldest_nonremovable =
 		TransactionIdOlder(h->shared_oldest_nonremovable,
 						   h->slot_catalog_xmin);
-	h->catalog_oldest_nonremovable = h->data_oldest_nonremovable;
+	h->catalog_oldest_nonremovable =
+		TransactionIdOlder(h->catalog_oldest_nonremovable,
+						   h->slot_xmin);
 	h->catalog_oldest_nonremovable =
 		TransactionIdOlder(h->catalog_oldest_nonremovable,
 						   h->slot_catalog_xmin);
@@ -1918,6 +1952,9 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	h->oldest_considered_running =
 		TransactionIdOlder(h->oldest_considered_running,
 						   h->data_oldest_nonremovable);
+	h->oldest_considered_running =
+			TransactionIdOlder(h->oldest_considered_running,
+							   h->create_index_concurrently_oldest_nonremovable);
 
 	/*
 	 * shared horizons have to be at least as old as the oldest visible in
@@ -1925,6 +1962,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	 */
 	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
 										 h->data_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
+										 h->create_index_concurrently_oldest_nonremovable));
 	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
 										 h->catalog_oldest_nonremovable));
 
@@ -1938,6 +1977,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 										 h->catalog_oldest_nonremovable));
 	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
 										 h->data_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->create_index_concurrently_oldest_nonremovable));
 	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
 										 h->temp_oldest_nonremovable));
 	Assert(!TransactionIdIsValid(h->slot_xmin) ||
@@ -1972,6 +2013,8 @@ GlobalVisHorizonKindForRel(Relation rel)
 	else if (IsCatalogRelation(rel) ||
 			 RelationIsAccessibleInLogicalDecoding(rel))
 		return VISHORIZON_CATALOG;
+	else if (rel != NULL && ((rel->rd_indexvalid && rel->rd_indexisbuilding) || IsToastRelation(rel)))
+		return VISHORIZON_BUILD_INDEX_CONCURRENTLY;
 	else if (!RELATION_IS_LOCAL(rel))
 		return VISHORIZON_DATA;
 	else
@@ -2004,6 +2047,8 @@ GetOldestNonRemovableTransactionId(Relation rel)
 			return horizons.catalog_oldest_nonremovable;
 		case VISHORIZON_DATA:
 			return horizons.data_oldest_nonremovable;
+		case VISHORIZON_BUILD_INDEX_CONCURRENTLY:
+			return horizons.create_index_concurrently_oldest_nonremovable;
 		case VISHORIZON_TEMP:
 			return horizons.temp_oldest_nonremovable;
 	}
@@ -2454,6 +2499,9 @@ GetSnapshotData(Snapshot snapshot)
 		GlobalVisDataRels.definitely_needed =
 			FullTransactionIdNewer(def_vis_fxid_data,
 								   GlobalVisDataRels.definitely_needed);
+		GlobalVisBuildIndexConcurrentlyRels.definitely_needed =
+				FullTransactionIdNewer(def_vis_fxid_data,
+									   GlobalVisBuildIndexConcurrentlyRels.definitely_needed);
 		/* See temp_oldest_nonremovable computation in ComputeXidHorizons() */
 		if (TransactionIdIsNormal(myxid))
 			GlobalVisTempRels.definitely_needed =
@@ -2478,6 +2526,9 @@ GetSnapshotData(Snapshot snapshot)
 		GlobalVisCatalogRels.maybe_needed =
 			FullTransactionIdNewer(GlobalVisCatalogRels.maybe_needed,
 								   oldestfxid);
+		GlobalVisBuildIndexConcurrentlyRels.maybe_needed =
+				FullTransactionIdNewer(GlobalVisBuildIndexConcurrentlyRels.maybe_needed,
+									   oldestfxid);
 		GlobalVisDataRels.maybe_needed =
 			FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
 								   oldestfxid);
@@ -4106,6 +4157,9 @@ GlobalVisTestFor(Relation rel)
 		case VISHORIZON_DATA:
 			state = &GlobalVisDataRels;
 			break;
+		case VISHORIZON_BUILD_INDEX_CONCURRENTLY:
+			state = &GlobalVisBuildIndexConcurrentlyRels;
+			break;
 		case VISHORIZON_TEMP:
 			state = &GlobalVisTempRels;
 			break;
@@ -4158,6 +4212,9 @@ GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
 	GlobalVisDataRels.maybe_needed =
 		FullXidRelativeTo(horizons->latest_completed,
 						  horizons->data_oldest_nonremovable);
+	GlobalVisBuildIndexConcurrentlyRels.maybe_needed =
+			FullXidRelativeTo(horizons->latest_completed,
+							  horizons->create_index_concurrently_oldest_nonremovable);
 	GlobalVisTempRels.maybe_needed =
 		FullXidRelativeTo(horizons->latest_completed,
 						  horizons->temp_oldest_nonremovable);
@@ -4176,6 +4233,9 @@ GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
 	GlobalVisDataRels.definitely_needed =
 		FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
 							   GlobalVisDataRels.definitely_needed);
+	GlobalVisBuildIndexConcurrentlyRels.definitely_needed =
+			FullTransactionIdNewer(GlobalVisBuildIndexConcurrentlyRels.maybe_needed,
+								   GlobalVisBuildIndexConcurrentlyRels.definitely_needed);
 	GlobalVisTempRels.definitely_needed = GlobalVisTempRels.maybe_needed;
 
 	ComputeXidHorizonsResultLastXmin = RecentXmin;
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 262c9878dd..677ba61205 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -4769,6 +4769,7 @@ RelationGetIndexList(Relation relation)
 	Oid			pkeyIndex = InvalidOid;
 	Oid			candidateIndex = InvalidOid;
 	bool		pkdeferrable = false;
+	bool 		indexisbuilding = false;
 	MemoryContext oldcxt;
 
 	/* Quick exit if we already computed the list. */
@@ -4809,6 +4810,10 @@ RelationGetIndexList(Relation relation)
 		/* add index's OID to result list */
 		result = lappend_oid(result, index->indexrelid);
 
+		/* consider index as building if it is ready but not yet valid */
+		if (index->indisready && !index->indisvalid)
+			indexisbuilding = true;
+
 		/*
 		 * Non-unique or predicate indexes aren't interesting for either oid
 		 * indexes or replication identity indexes, so don't check them.
@@ -4869,6 +4874,7 @@ RelationGetIndexList(Relation relation)
 	relation->rd_indexlist = list_copy(result);
 	relation->rd_pkindex = pkeyIndex;
 	relation->rd_ispkdeferrable = pkdeferrable;
+	relation->rd_indexisbuilding = indexisbuilding;
 	if (replident == REPLICA_IDENTITY_DEFAULT && OidIsValid(pkeyIndex) && !pkdeferrable)
 		relation->rd_replidindex = pkeyIndex;
 	else if (replident == REPLICA_IDENTITY_INDEX && OidIsValid(candidateIndex))
diff --git a/src/bin/pg_amcheck/t/006_concurrently.pl b/src/bin/pg_amcheck/t/006_concurrently.pl
new file mode 100644
index 0000000000..7b8afeead5
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_concurrently.pl
@@ -0,0 +1,155 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+use IPC::SysV;
+use threads;
+use Test::More;
+use Test::Builder;
+
+if ($@ || $windows_os)
+{
+	plan skip_all => 'Fork and shared memory are not supported by this platform';
+}
+
+# TODO: refactor to https://metacpan.org/pod/IPC%3A%3AShareable
+my ($pid, $shmem_id, $shmem_key,  $shmem_size);
+eval 'sub IPC_CREAT {0001000}' unless defined &IPC_CREAT;
+$shmem_size = 4;
+$shmem_key = rand(1000000);
+$shmem_id = shmget($shmem_key, $shmem_size, &IPC_CREAT | 0777) or die "Can't shmget: $!";
+shmwrite($shmem_id, "wait", 0, $shmem_size) or die "Can't shmwrite: $!";
+
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+#
+# Test set-up
+#
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0,c2 money default 0,
+								c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX idx ON tbl(i)));
+
+my $builder = Test::More->builder;
+$builder->use_numbers(0);
+$builder->no_plan();
+
+my $child  = $builder->child("pg_bench");
+
+if(!defined($pid = fork())) {
+	# fork returned undef, so unsuccessful
+	die "Cannot fork a child: $!";
+} elsif ($pid == 0) {
+
+	$node->pgbench(
+		'--no-vacuum --client=5 --transactions=25000',
+		0,
+		[qr{actually processed}],
+		[qr{^$}],
+		'concurrent INSERTs, UPDATES and RC',
+		{
+			'002_pgbench_concurrent_transaction_inserts' => q(
+				BEGIN;
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				COMMIT;
+			  ),
+			# Ensure some HOT updates happen
+			'002_pgbench_concurrent_transaction_updates' => q(
+				BEGIN;
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				COMMIT;
+			  )
+		});
+
+	if ($child->is_passing()) {
+		shmwrite($shmem_id, "done", 0, $shmem_size) or die "Can't shmwrite: $!";
+	} else {
+		shmwrite($shmem_id, "fail", 0, $shmem_size) or die "Can't shmwrite: $!";
+	}
+
+	sleep(1);
+} else {
+	my $pg_bench_fork_flag;
+	shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+
+	subtest 'reindex run subtest' => sub {
+		is($pg_bench_fork_flag, "wait", "pg_bench_fork_flag is correct");
+
+		my %psql = (stdin => '', stdout => '', stderr => '');
+		$psql{run} = IPC::Run::start(
+			[ 'psql', '-XA', '-f', '-', '-d', $node->connstr('postgres') ],
+			'<',
+			\$psql{stdin},
+			'>',
+			\$psql{stdout},
+			'2>',
+			\$psql{stderr},
+			$psql_timeout);
+
+		my ($result, $stdout, $stderr);
+		while (1)
+		{
+
+			($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx;));
+			is($result, '0', 'REINDEX is correct');
+
+			($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx', true, true);));
+			is($result, '0', 'bt_index_check is correct');
+ 			if ($result)
+ 			{
+				diag($stderr);
+ 			}
+
+			shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+			last if $pg_bench_fork_flag ne "wait";
+		}
+
+		# explicitly shut down psql instances gracefully
+        $psql{stdin} .= "\\q\n";
+        $psql{run}->finish;
+
+		is($pg_bench_fork_flag, "done", "pg_bench_fork_flag is correct");
+	};
+
+
+	$child->finalize();
+	$child->summary();
+	$node->stop;
+	done_testing();
+}
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 8700204953..a9e2d1beab 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -152,6 +152,7 @@ typedef struct RelationData
 	List	   *rd_indexlist;	/* list of OIDs of indexes on relation */
 	Oid			rd_pkindex;		/* OID of (deferrable?) primary key, if any */
 	bool		rd_ispkdeferrable;	/* is rd_pkindex a deferrable PK? */
+	bool		rd_indexisbuilding; /* is index building in progress for relation */
 	Oid			rd_replidindex; /* OID of replica identity index, if any */
 
 	/* data managed by RelationGetStatExtList: */
-- 
2.34.1



^ permalink  raw  reply  [nested|flat] 5+ messages in thread

end of thread, other threads:[~2024-05-05 23:37 UTC | newest]

Thread overview: 5+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2024-03-05 20:08 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
2024-03-07 18:36 ` Michail Nikolaev <[email protected]>
2024-03-12 11:50   ` Matthias van de Meent <[email protected]>
2024-05-04 15:51     ` Michail Nikolaev <[email protected]>
2024-05-05 23:37       ` Michail Nikolaev <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox